mirror of
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
synced 2026-05-03 19:26:01 -04:00
b3ebb9b7e92a928344a7a2c1f8514474bfa113cf
1266921 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
b3ebb9b7e9 |
btrfs: stop extent map shrinker if reschedule is needed
The extent map shrinker can be called in a variety of contexts where we are under memory pressure, and of them is when a task is trying to allocate memory. For this reason the shrinker is typically called with a value of struct shrink_control::nr_to_scan that is much smaller than what we return in the nr_cached_objects callback of struct super_operations (fs/btrfs/super.c:btrfs_nr_cached_objects()), so that the shrinker does not take a long time and cause high latencies. However we can still take a lot of time in the shrinker even for a limited amount of nr_to_scan: 1) When traversing the red black tree that tracks open inodes in a root, as for example with millions of open inodes we get a deep tree which takes time searching for an inode; 2) Iterating over the extent map tree, which is a red black tree, of an inode when doing the rb_next() calls and when removing an extent map from the tree, since often that requires rebalancing the red black tree; 3) When trying to write lock an inode's extent map tree we may wait for a significant amount of time, because there's either another task about to do IO and searching for an extent map in the tree or inserting an extent map in the tree, and we can have thousands or even millions of extent maps for an inode. Furthermore, there can be concurrent calls to the shrinker so the lock might be busy simply because there is already another task shrinking extent maps for the same inode; 4) We often reschedule if we need to, which further increases latency. So improve on this by stopping the extent map shrinking code whenever we need to reschedule and make it skip an inode if we can't immediately lock its extent map tree. Reported-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com> Reported-by: Andrea Gelmini <andrea.gelmini@gmail.com> Link: https://lore.kernel.org/linux-btrfs/CABXGCsMmmb36ym8hVNGTiU8yfUS_cGvoUmGCcBrGWq9OxTrs+A@mail.gmail.com/ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
|
68a3ebd18b |
btrfs: use delayed iput during extent map shrinking
When putting an inode during extent map shrinking we're doing a standard
iput() but that may take a long time in case the inode is dirty and we are
doing the final iput that triggers eviction - the VFS will have to wait
for writeback before calling the btrfs evict callback (see
fs/inode.c:evict()).
This slows down the task running the shrinker which may have been
triggered while updating some tree for example, meaning locks are held
as well as an open transaction handle.
Also if the iput() ends up triggering eviction and the inode has no links
anymore, then we trigger item truncation which requires flushing delayed
items, space reservation to start a transaction and that may trigger the
space reclaim task and wait for it, resulting in deadlocks in case the
reclaim task needs for example to commit a transaction and the shrinker
is being triggered from a path holding a transaction handle.
Syzbot reported such a case with the following stack traces:
======================================================
WARNING: possible circular locking dependency detected
6.10.0-rc2-syzkaller-00010-g2ab795141095 #0 Not tainted
------------------------------------------------------
kswapd0/111 is trying to acquire lock:
ffff88801eae4610 (sb_internal#3){.+.+}-{0:0}, at: btrfs_commit_inode_delayed_inode+0x110/0x330 fs/btrfs/delayed-inode.c:1275
but task is already holding lock:
ffffffff8dd3a9a0 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0xa88/0x1970 mm/vmscan.c:6924
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #3 (fs_reclaim){+.+.}-{0:0}:
__fs_reclaim_acquire mm/page_alloc.c:3783 [inline]
fs_reclaim_acquire+0x102/0x160 mm/page_alloc.c:3797
might_alloc include/linux/sched/mm.h:334 [inline]
slab_pre_alloc_hook mm/slub.c:3890 [inline]
slab_alloc_node mm/slub.c:3980 [inline]
kmem_cache_alloc_lru_noprof+0x58/0x2f0 mm/slub.c:4019
btrfs_alloc_inode+0x118/0xb20 fs/btrfs/inode.c:8411
alloc_inode+0x5d/0x230 fs/inode.c:261
iget5_locked fs/inode.c:1235 [inline]
iget5_locked+0x1c9/0x2c0 fs/inode.c:1228
btrfs_iget_locked fs/btrfs/inode.c:5590 [inline]
btrfs_iget_path fs/btrfs/inode.c:5607 [inline]
btrfs_iget+0xfb/0x230 fs/btrfs/inode.c:5636
create_reloc_inode+0x403/0x820 fs/btrfs/relocation.c:3911
btrfs_relocate_block_group+0x471/0xe60 fs/btrfs/relocation.c:4114
btrfs_relocate_chunk+0x143/0x450 fs/btrfs/volumes.c:3373
__btrfs_balance fs/btrfs/volumes.c:4157 [inline]
btrfs_balance+0x211a/0x3f00 fs/btrfs/volumes.c:4534
btrfs_ioctl_balance fs/btrfs/ioctl.c:3675 [inline]
btrfs_ioctl+0x12ed/0x8290 fs/btrfs/ioctl.c:4742
__do_compat_sys_ioctl+0x2c3/0x330 fs/ioctl.c:1007
do_syscall_32_irqs_on arch/x86/entry/common.c:165 [inline]
__do_fast_syscall_32+0x73/0x120 arch/x86/entry/common.c:386
do_fast_syscall_32+0x32/0x80 arch/x86/entry/common.c:411
entry_SYSENTER_compat_after_hwframe+0x84/0x8e
-> #2 (btrfs_trans_num_extwriters){++++}-{0:0}:
join_transaction+0x164/0xf40 fs/btrfs/transaction.c:315
start_transaction+0x427/0x1a70 fs/btrfs/transaction.c:700
btrfs_rebuild_free_space_tree+0xaa/0x480 fs/btrfs/free-space-tree.c:1323
btrfs_start_pre_rw_mount+0x218/0xf60 fs/btrfs/disk-io.c:2999
open_ctree+0x41ab/0x52e0 fs/btrfs/disk-io.c:3554
btrfs_fill_super fs/btrfs/super.c:946 [inline]
btrfs_get_tree_super fs/btrfs/super.c:1863 [inline]
btrfs_get_tree+0x11e9/0x1b90 fs/btrfs/super.c:2089
vfs_get_tree+0x8f/0x380 fs/super.c:1780
fc_mount+0x16/0xc0 fs/namespace.c:1125
btrfs_get_tree_subvol fs/btrfs/super.c:2052 [inline]
btrfs_get_tree+0xa53/0x1b90 fs/btrfs/super.c:2090
vfs_get_tree+0x8f/0x380 fs/super.c:1780
do_new_mount fs/namespace.c:3352 [inline]
path_mount+0x6e1/0x1f10 fs/namespace.c:3679
do_mount fs/namespace.c:3692 [inline]
__do_sys_mount fs/namespace.c:3898 [inline]
__se_sys_mount fs/namespace.c:3875 [inline]
__ia32_sys_mount+0x295/0x320 fs/namespace.c:3875
do_syscall_32_irqs_on arch/x86/entry/common.c:165 [inline]
__do_fast_syscall_32+0x73/0x120 arch/x86/entry/common.c:386
do_fast_syscall_32+0x32/0x80 arch/x86/entry/common.c:411
entry_SYSENTER_compat_after_hwframe+0x84/0x8e
-> #1 (btrfs_trans_num_writers){++++}-{0:0}:
join_transaction+0x148/0xf40 fs/btrfs/transaction.c:314
start_transaction+0x427/0x1a70 fs/btrfs/transaction.c:700
btrfs_rebuild_free_space_tree+0xaa/0x480 fs/btrfs/free-space-tree.c:1323
btrfs_start_pre_rw_mount+0x218/0xf60 fs/btrfs/disk-io.c:2999
open_ctree+0x41ab/0x52e0 fs/btrfs/disk-io.c:3554
btrfs_fill_super fs/btrfs/super.c:946 [inline]
btrfs_get_tree_super fs/btrfs/super.c:1863 [inline]
btrfs_get_tree+0x11e9/0x1b90 fs/btrfs/super.c:2089
vfs_get_tree+0x8f/0x380 fs/super.c:1780
fc_mount+0x16/0xc0 fs/namespace.c:1125
btrfs_get_tree_subvol fs/btrfs/super.c:2052 [inline]
btrfs_get_tree+0xa53/0x1b90 fs/btrfs/super.c:2090
vfs_get_tree+0x8f/0x380 fs/super.c:1780
do_new_mount fs/namespace.c:3352 [inline]
path_mount+0x6e1/0x1f10 fs/namespace.c:3679
do_mount fs/namespace.c:3692 [inline]
__do_sys_mount fs/namespace.c:3898 [inline]
__se_sys_mount fs/namespace.c:3875 [inline]
__ia32_sys_mount+0x295/0x320 fs/namespace.c:3875
do_syscall_32_irqs_on arch/x86/entry/common.c:165 [inline]
__do_fast_syscall_32+0x73/0x120 arch/x86/entry/common.c:386
do_fast_syscall_32+0x32/0x80 arch/x86/entry/common.c:411
entry_SYSENTER_compat_after_hwframe+0x84/0x8e
-> #0 (sb_internal#3){.+.+}-{0:0}:
check_prev_add kernel/locking/lockdep.c:3134 [inline]
check_prevs_add kernel/locking/lockdep.c:3253 [inline]
validate_chain kernel/locking/lockdep.c:3869 [inline]
__lock_acquire+0x2478/0x3b30 kernel/locking/lockdep.c:5137
lock_acquire kernel/locking/lockdep.c:5754 [inline]
lock_acquire+0x1b1/0x560 kernel/locking/lockdep.c:5719
percpu_down_read include/linux/percpu-rwsem.h:51 [inline]
__sb_start_write include/linux/fs.h:1655 [inline]
sb_start_intwrite include/linux/fs.h:1838 [inline]
start_transaction+0xbc1/0x1a70 fs/btrfs/transaction.c:694
btrfs_commit_inode_delayed_inode+0x110/0x330 fs/btrfs/delayed-inode.c:1275
btrfs_evict_inode+0x960/0xe80 fs/btrfs/inode.c:5291
evict+0x2ed/0x6c0 fs/inode.c:667
iput_final fs/inode.c:1741 [inline]
iput.part.0+0x5a8/0x7f0 fs/inode.c:1767
iput+0x5c/0x80 fs/inode.c:1757
btrfs_scan_root fs/btrfs/extent_map.c:1118 [inline]
btrfs_free_extent_maps+0xbd3/0x1320 fs/btrfs/extent_map.c:1189
super_cache_scan+0x409/0x550 fs/super.c:227
do_shrink_slab+0x44f/0x11c0 mm/shrinker.c:435
shrink_slab+0x18a/0x1310 mm/shrinker.c:662
shrink_one+0x493/0x7c0 mm/vmscan.c:4790
shrink_many mm/vmscan.c:4851 [inline]
lru_gen_shrink_node+0x89f/0x1750 mm/vmscan.c:4951
shrink_node mm/vmscan.c:5910 [inline]
kswapd_shrink_node mm/vmscan.c:6720 [inline]
balance_pgdat+0x1105/0x1970 mm/vmscan.c:6911
kswapd+0x5ea/0xbf0 mm/vmscan.c:7180
kthread+0x2c1/0x3a0 kernel/kthread.c:389
ret_from_fork+0x45/0x80 arch/x86/kernel/process.c:147
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
other info that might help us debug this:
Chain exists of:
sb_internal#3 --> btrfs_trans_num_extwriters --> fs_reclaim
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(fs_reclaim);
lock(btrfs_trans_num_extwriters);
lock(fs_reclaim);
rlock(sb_internal#3);
*** DEADLOCK ***
2 locks held by kswapd0/111:
#0: ffffffff8dd3a9a0 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0xa88/0x1970 mm/vmscan.c:6924
#1: ffff88801eae40e0 (&type->s_umount_key#62){++++}-{3:3}, at: super_trylock_shared fs/super.c:562 [inline]
#1: ffff88801eae40e0 (&type->s_umount_key#62){++++}-{3:3}, at: super_cache_scan+0x96/0x550 fs/super.c:196
stack backtrace:
CPU: 0 PID: 111 Comm: kswapd0 Not tainted 6.10.0-rc2-syzkaller-00010-g2ab795141095 #0
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0x116/0x1f0 lib/dump_stack.c:114
check_noncircular+0x31a/0x400 kernel/locking/lockdep.c:2187
check_prev_add kernel/locking/lockdep.c:3134 [inline]
check_prevs_add kernel/locking/lockdep.c:3253 [inline]
validate_chain kernel/locking/lockdep.c:3869 [inline]
__lock_acquire+0x2478/0x3b30 kernel/locking/lockdep.c:5137
lock_acquire kernel/locking/lockdep.c:5754 [inline]
lock_acquire+0x1b1/0x560 kernel/locking/lockdep.c:5719
percpu_down_read include/linux/percpu-rwsem.h:51 [inline]
__sb_start_write include/linux/fs.h:1655 [inline]
sb_start_intwrite include/linux/fs.h:1838 [inline]
start_transaction+0xbc1/0x1a70 fs/btrfs/transaction.c:694
btrfs_commit_inode_delayed_inode+0x110/0x330 fs/btrfs/delayed-inode.c:1275
btrfs_evict_inode+0x960/0xe80 fs/btrfs/inode.c:5291
evict+0x2ed/0x6c0 fs/inode.c:667
iput_final fs/inode.c:1741 [inline]
iput.part.0+0x5a8/0x7f0 fs/inode.c:1767
iput+0x5c/0x80 fs/inode.c:1757
btrfs_scan_root fs/btrfs/extent_map.c:1118 [inline]
btrfs_free_extent_maps+0xbd3/0x1320 fs/btrfs/extent_map.c:1189
super_cache_scan+0x409/0x550 fs/super.c:227
do_shrink_slab+0x44f/0x11c0 mm/shrinker.c:435
shrink_slab+0x18a/0x1310 mm/shrinker.c:662
shrink_one+0x493/0x7c0 mm/vmscan.c:4790
shrink_many mm/vmscan.c:4851 [inline]
lru_gen_shrink_node+0x89f/0x1750 mm/vmscan.c:4951
shrink_node mm/vmscan.c:5910 [inline]
kswapd_shrink_node mm/vmscan.c:6720 [inline]
balance_pgdat+0x1105/0x1970 mm/vmscan.c:6911
kswapd+0x5ea/0xbf0 mm/vmscan.c:7180
kthread+0x2c1/0x3a0 kernel/kthread.c:389
ret_from_fork+0x45/0x80 arch/x86/kernel/process.c:147
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
</TASK>
So fix this by using btrfs_add_delayed_iput() so that the final iput is
delegated to the cleaner kthread.
Link: https://lore.kernel.org/linux-btrfs/000000000000892280061a344581@google.com/
Reported-by: syzbot+3dad89b3993a4b275e72@syzkaller.appspotmail.com
Fixes:
|
||
|
|
a56c85fa2d |
btrfs: fix folio refcount in __alloc_dummy_extent_buffer()
Another improper use of __folio_put() in an error path after freshly
allocating pages/folios which returns them with the refcount initialized
to 1. The refactor from __free_pages() -> __folio_put() (instead of
folio_put) removed a refcount decrement found in __free_pages() and
folio_put but absent from __folio_put().
Fixes:
|
||
|
|
da0386c1c7 |
btrfs: fix folio refcount in btrfs_do_encoded_write()
The conversion to folios switched __free_page() to __folio_put() in the
error path in btrfs_do_encoded_write().
However, this gets the page refcounting wrong. If we do hit that error
path (I reproduced by modifying btrfs_do_encoded_write to pretend to
always fail in a way that jumps to out_folios and running the fstests
case btrfs/281), then we always hit the following BUG freeing the folio:
BUG: Bad page state in process btrfs pfn:40ab0b
page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x61be5 pfn:0x40ab0b
flags: 0x5ffff0000000000(node=0|zone=2|lastcpupid=0x1ffff)
raw: 05ffff0000000000 0000000000000000 dead000000000122 0000000000000000
raw: 0000000000061be5 0000000000000000 00000001ffffffff 0000000000000000
page dumped because: nonzero _refcount
Call Trace:
<TASK>
dump_stack_lvl+0x3d/0xe0
bad_page+0xea/0xf0
free_unref_page+0x8e1/0x900
? __mem_cgroup_uncharge+0x69/0x90
__folio_put+0xe6/0x190
btrfs_do_encoded_write+0x445/0x780
? current_time+0x25/0xd0
btrfs_do_write_iter+0x2cc/0x4b0
btrfs_ioctl_encoded_write+0x2b6/0x340
It turns out __free_page() decreases the page reference count while
__folio_put() does not. Switch __folio_put() to folio_put() which
decreases the folio reference count first.
Fixes:
|
||
|
|
9da45c88e1 |
btrfs: fix uninitialized return value in the ref-verify tool
In the ref-verify tool, when processing the inline references of an extent
item, we may end up returning with uninitialized return value, because:
1) The 'ret' variable is not initialized if there are no inline extent
references ('ptr' == 'end' before the while loop starts);
2) If we find an extent owner inline reference we don't initialize 'ret'.
So fix these cases by initializing 'ret' to 0 when declaring the variable
and set it to -EINVAL if we find an extent owner inline references and
simple quotas are not enabled (as well as print an error message).
Reported-by: Mirsad Todorovac <mtodorovac69@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/59b40ebe-c824-457d-8b24-0bbca69d472b@gmail.com/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
||
|
|
724d8042ce |
btrfs: always do the basic checks for btrfs_qgroup_inherit structure
[BUG] Syzbot reports the following regression detected by KASAN: BUG: KASAN: slab-out-of-bounds in btrfs_qgroup_inherit+0x42e/0x2e20 fs/btrfs/qgroup.c:3277 Read of size 8 at addr ffff88814628ca50 by task syz-executor318/5171 CPU: 0 PID: 5171 Comm: syz-executor318 Not tainted 6.10.0-rc2-syzkaller-00010-g2ab795141095 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 04/02/2024 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x241/0x360 lib/dump_stack.c:114 print_address_description mm/kasan/report.c:377 [inline] print_report+0x169/0x550 mm/kasan/report.c:488 kasan_report+0x143/0x180 mm/kasan/report.c:601 btrfs_qgroup_inherit+0x42e/0x2e20 fs/btrfs/qgroup.c:3277 create_pending_snapshot+0x1359/0x29b0 fs/btrfs/transaction.c:1854 create_pending_snapshots+0x195/0x1d0 fs/btrfs/transaction.c:1922 btrfs_commit_transaction+0xf20/0x3740 fs/btrfs/transaction.c:2382 create_snapshot+0x6a1/0x9e0 fs/btrfs/ioctl.c:875 btrfs_mksubvol+0x58f/0x710 fs/btrfs/ioctl.c:1029 btrfs_mksnapshot+0xb5/0xf0 fs/btrfs/ioctl.c:1075 __btrfs_ioctl_snap_create+0x387/0x4b0 fs/btrfs/ioctl.c:1340 btrfs_ioctl_snap_create_v2+0x1f2/0x3a0 fs/btrfs/ioctl.c:1422 btrfs_ioctl+0x99e/0xc60 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:907 [inline] __se_sys_ioctl+0xfc/0x170 fs/ioctl.c:893 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7fcbf1992509 RSP: 002b:00007fcbf1928218 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 00007fcbf1a1f618 RCX: 00007fcbf1992509 RDX: 0000000020000280 RSI: 0000000050009417 RDI: 0000000000000003 RBP: 00007fcbf1a1f610 R08: 00007ffea1298e97 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00007fcbf19eb660 R13: 00000000200002b8 R14: 00007fcbf19e60c0 R15: 0030656c69662f2e </TASK> And it also pinned it down to commit |
||
|
|
64d2c847ba |
btrfs: zoned: fix calc_available_free_space() for zoned mode
calc_available_free_space() returns the total size of metadata (or
system) block groups, which can be allocated from unallocated disk
space. The logic is wrong on zoned mode in two places.
First, the calculation of data_chunk_size is wrong. We always allocate
one zone as one chunk, and no partial allocation of a zone. So, we
should use zone_size (= data_sinfo->chunk_size) as it is.
Second, the result "avail" may not be zone aligned. Since we always
allocate one zone as one chunk on zoned mode, returning non-zone size
aligned bytes will result in less pressure on the async metadata reclaim
process.
This is serious for the nearly full state with a large zone size device.
Allowing over-commit too much will result in less async reclaim work and
end up in ENOSPC. We can align down to the zone size to avoid that.
Fixes:
|
||
|
|
48f091fd50 |
btrfs: fix adding block group to a reclaim list and the unused list during reclaim
There is a potential parallel list adding for retrying in
btrfs_reclaim_bgs_work and adding to the unused list. Since the block
group is removed from the reclaim list and it is on a relocation work,
it can be added into the unused list in parallel. When that happens,
adding it to the reclaim list will corrupt the list head and trigger
list corruption like below.
Fix it by taking fs_info->unused_bgs_lock.
[177.504][T2585409] BTRFS error (device nullb1): error relocating ch= unk 2415919104
[177.514][T2585409] list_del corruption. next->prev should be ff1100= 0344b119c0, but was ff11000377e87c70. (next=3Dff110002390cd9c0)
[177.529][T2585409] ------------[ cut here ]------------
[177.537][T2585409] kernel BUG at lib/list_debug.c:65!
[177.545][T2585409] Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI
[177.555][T2585409] CPU: 9 PID: 2585409 Comm: kworker/u128:2 Tainted: G W 6.10.0-rc5-kts #1
[177.568][T2585409] Hardware name: Supermicro SYS-520P-WTR/X12SPW-TF, BIOS 1.2 02/14/2022
[177.579][T2585409] Workqueue: events_unbound btrfs_reclaim_bgs_work[btrfs]
[177.589][T2585409] RIP: 0010:__list_del_entry_valid_or_report.cold+0x70/0x72
[177.624][T2585409] RSP: 0018:ff11000377e87a70 EFLAGS: 00010286
[177.633][T2585409] RAX: 000000000000006d RBX: ff11000344b119c0 RCX:0000000000000000
[177.644][T2585409] RDX: 000000000000006d RSI: 0000000000000008 RDI:ffe21c006efd0f40
[177.655][T2585409] RBP: ff110002e0509f78 R08: 0000000000000001 R09:ffe21c006efd0f08
[177.665][T2585409] R10: ff11000377e87847 R11: 0000000000000000 R12:ff110002390cd9c0
[177.676][T2585409] R13: ff11000344b119c0 R14: ff110002e0508000 R15:dffffc0000000000
[177.687][T2585409] FS: 0000000000000000(0000) GS:ff11000fec880000(0000) knlGS:0000000000000000
[177.700][T2585409] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[177.709][T2585409] CR2: 00007f06bc7b1978 CR3: 0000001021e86005 CR4:0000000000771ef0
[177.720][T2585409] DR0: 0000000000000000 DR1: 0000000000000000 DR2:0000000000000000
[177.731][T2585409] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:0000000000000400
[177.742][T2585409] PKRU: 55555554
[177.748][T2585409] Call Trace:
[177.753][T2585409] <TASK>
[177.759][T2585409] ? __die_body.cold+0x19/0x27
[177.766][T2585409] ? die+0x2e/0x50
[177.772][T2585409] ? do_trap+0x1ea/0x2d0
[177.779][T2585409] ? __list_del_entry_valid_or_report.cold+0x70/0x72
[177.788][T2585409] ? do_error_trap+0xa3/0x160
[177.795][T2585409] ? __list_del_entry_valid_or_report.cold+0x70/0x72
[177.805][T2585409] ? handle_invalid_op+0x2c/0x40
[177.812][T2585409] ? __list_del_entry_valid_or_report.cold+0x70/0x72
[177.820][T2585409] ? exc_invalid_op+0x2d/0x40
[177.827][T2585409] ? asm_exc_invalid_op+0x1a/0x20
[177.834][T2585409] ? __list_del_entry_valid_or_report.cold+0x70/0x72
[177.843][T2585409] btrfs_delete_unused_bgs+0x3d9/0x14c0 [btrfs]
There is a similar retry_list code in btrfs_delete_unused_bgs(), but it is
safe, AFAICS. Since the block group was in the unused list, the used bytes
should be 0 when it was added to the unused list. Then, it checks
block_group->{used,reserved,pinned} are still 0 under the
block_group->lock. So, they should be still eligible for the unused list,
not the reclaim list.
The reason it is safe there it's because because we're holding
space_info->groups_sem in write mode.
That means no other task can allocate from the block group, so while we
are at deleted_unused_bgs() it's not possible for other tasks to
allocate and deallocate extents from the block group, so it can't be
added to the unused list or the reclaim list by anyone else.
The bug can be reproduced by btrfs/166 after a few rounds. In practice
this can be hit when relocation cannot find more chunk space and ends
with ENOSPC.
Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Suggested-by: Johannes Thumshirn <Johannes.Thumshirn@wdc.com>
Fixes:
|
||
|
|
a7e4c6a303 |
btrfs: qgroup: fix quota root leak after quota disable failure
If during the quota disable we fail when cleaning the quota tree or when
deleting the root from the root tree, we jump to the 'out' label without
ever dropping the reference on the quota root, resulting in a leak of the
root since fs_info->quota_root is no longer pointing to the root (we have
set it to NULL just before those steps).
Fix this by always doing a btrfs_put_root() call under the 'out' label.
This is a problem that exists since qgroups were first added in 2012 by
commit
|
||
|
|
2c49908634 |
btrfs: scrub: handle RST lookup error correctly
[BUG] When running btrfs/060 with forced RST feature, it would crash the following ASSERT() inside scrub_read_endio(): ASSERT(sector_nr < stripe->nr_sectors); Before that, we would have tree dump from btrfs_get_raid_extent_offset(), as we failed to find the RST entry for the range. [CAUSE] Inside scrub_submit_extent_sector_read() every time we allocated a new bbio we immediately called btrfs_map_block() to make sure there was some RST range covering the scrub target. But if btrfs_map_block() fails, we immediately call endio for the bbio, while the bbio is newly allocated, it's completely empty. Then inside scrub_read_endio(), we go through the bvecs to find the sector number (as bi_sector is no longer reliable if the bio is submitted to lower layers). And since the bio is empty, such bvecs iteration would not find any sector matching the sector, and return sector_nr == stripe->nr_sectors, triggering the ASSERT(). [FIX] Instead of calling btrfs_map_block() after allocating a new bbio, call btrfs_map_block() first. Since our only objective of calling btrfs_map_block() is only to update stripe_len, there is really no need to do that after btrfs_alloc_bio(). This new timing would avoid the problem of handling empty bbio completely, and in fact fixes a possible race window for the old code, where if the submission thread is the only owner of the pending_io, the scrub would never finish (since we didn't decrease the pending_io counter). Although the root cause of RST lookup failure still needs to be addressed. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
|
b9fd2affe4 |
btrfs: zoned: fix initial free space detection
When creating a new block group, it calls btrfs_add_new_free_space() to add
the entire block group range into the free space accounting.
__btrfs_add_free_space_zoned() checks if size == block_group->length to
detect the initial free space adding, and proceed that case properly.
However, if the zone_capacity == zone_size and the over-write speed is fast
enough, the entire zone can be over-written within one transaction. That
confuses __btrfs_add_free_space_zoned() to handle it as an initial free
space accounting. As a result, that block group becomes a strange state: 0
used bytes, 0 zone_unusable bytes, but alloc_offset == zone_capacity (no
allocation anymore).
The initial free space accounting can properly be checked by checking
alloc_offset too.
Fixes:
|
||
|
|
d1825752e3 |
btrfs: use NOFS context when getting inodes during logging and log replay
During inode logging (and log replay too), we are holding a transaction
handle and we often need to call btrfs_iget(), which will read an inode
from its subvolume btree if it's not loaded in memory and that results in
allocating an inode with GFP_KERNEL semantics at the btrfs_alloc_inode()
callback - and this may recurse into the filesystem in case we are under
memory pressure and attempt to commit the current transaction, resulting
in a deadlock since the logging (or log replay) task is holding a
transaction handle open.
Syzbot reported this with the following stack traces:
WARNING: possible circular locking dependency detected
6.10.0-rc2-syzkaller-00361-g061d1af7b030 #0 Not tainted
------------------------------------------------------
syz-executor.1/9919 is trying to acquire lock:
ffffffff8dd3aac0 (fs_reclaim){+.+.}-{0:0}, at: might_alloc include/linux/sched/mm.h:334 [inline]
ffffffff8dd3aac0 (fs_reclaim){+.+.}-{0:0}, at: slab_pre_alloc_hook mm/slub.c:3891 [inline]
ffffffff8dd3aac0 (fs_reclaim){+.+.}-{0:0}, at: slab_alloc_node mm/slub.c:3981 [inline]
ffffffff8dd3aac0 (fs_reclaim){+.+.}-{0:0}, at: kmem_cache_alloc_lru_noprof+0x58/0x2f0 mm/slub.c:4020
but task is already holding lock:
ffff88804b569358 (&ei->log_mutex){+.+.}-{3:3}, at: btrfs_log_inode+0x39c/0x4660 fs/btrfs/tree-log.c:6481
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #3 (&ei->log_mutex){+.+.}-{3:3}:
__mutex_lock_common kernel/locking/mutex.c:608 [inline]
__mutex_lock+0x175/0x9c0 kernel/locking/mutex.c:752
btrfs_log_inode+0x39c/0x4660 fs/btrfs/tree-log.c:6481
btrfs_log_inode_parent+0x8cb/0x2a90 fs/btrfs/tree-log.c:7079
btrfs_log_dentry_safe+0x59/0x80 fs/btrfs/tree-log.c:7180
btrfs_sync_file+0x9c1/0xe10 fs/btrfs/file.c:1959
vfs_fsync_range+0x141/0x230 fs/sync.c:188
generic_write_sync include/linux/fs.h:2794 [inline]
btrfs_do_write_iter+0x584/0x10c0 fs/btrfs/file.c:1705
new_sync_write fs/read_write.c:497 [inline]
vfs_write+0x6b6/0x1140 fs/read_write.c:590
ksys_write+0x12f/0x260 fs/read_write.c:643
do_syscall_32_irqs_on arch/x86/entry/common.c:165 [inline]
__do_fast_syscall_32+0x73/0x120 arch/x86/entry/common.c:386
do_fast_syscall_32+0x32/0x80 arch/x86/entry/common.c:411
entry_SYSENTER_compat_after_hwframe+0x84/0x8e
-> #2 (btrfs_trans_num_extwriters){++++}-{0:0}:
join_transaction+0x164/0xf40 fs/btrfs/transaction.c:315
start_transaction+0x427/0x1a70 fs/btrfs/transaction.c:700
btrfs_commit_super+0xa1/0x110 fs/btrfs/disk-io.c:4170
close_ctree+0xcb0/0xf90 fs/btrfs/disk-io.c:4324
generic_shutdown_super+0x159/0x3d0 fs/super.c:642
kill_anon_super+0x3a/0x60 fs/super.c:1226
btrfs_kill_super+0x3b/0x50 fs/btrfs/super.c:2096
deactivate_locked_super+0xbe/0x1a0 fs/super.c:473
deactivate_super+0xde/0x100 fs/super.c:506
cleanup_mnt+0x222/0x450 fs/namespace.c:1267
task_work_run+0x14e/0x250 kernel/task_work.c:180
resume_user_mode_work include/linux/resume_user_mode.h:50 [inline]
exit_to_user_mode_loop kernel/entry/common.c:114 [inline]
exit_to_user_mode_prepare include/linux/entry-common.h:328 [inline]
__syscall_exit_to_user_mode_work kernel/entry/common.c:207 [inline]
syscall_exit_to_user_mode+0x278/0x2a0 kernel/entry/common.c:218
__do_fast_syscall_32+0x80/0x120 arch/x86/entry/common.c:389
do_fast_syscall_32+0x32/0x80 arch/x86/entry/common.c:411
entry_SYSENTER_compat_after_hwframe+0x84/0x8e
-> #1 (btrfs_trans_num_writers){++++}-{0:0}:
__lock_release kernel/locking/lockdep.c:5468 [inline]
lock_release+0x33e/0x6c0 kernel/locking/lockdep.c:5774
percpu_up_read include/linux/percpu-rwsem.h:99 [inline]
__sb_end_write include/linux/fs.h:1650 [inline]
sb_end_intwrite include/linux/fs.h:1767 [inline]
__btrfs_end_transaction+0x5ca/0x920 fs/btrfs/transaction.c:1071
btrfs_commit_inode_delayed_inode+0x228/0x330 fs/btrfs/delayed-inode.c:1301
btrfs_evict_inode+0x960/0xe80 fs/btrfs/inode.c:5291
evict+0x2ed/0x6c0 fs/inode.c:667
iput_final fs/inode.c:1741 [inline]
iput.part.0+0x5a8/0x7f0 fs/inode.c:1767
iput+0x5c/0x80 fs/inode.c:1757
dentry_unlink_inode+0x295/0x480 fs/dcache.c:400
__dentry_kill+0x1d0/0x600 fs/dcache.c:603
dput.part.0+0x4b1/0x9b0 fs/dcache.c:845
dput+0x1f/0x30 fs/dcache.c:835
ovl_stack_put+0x60/0x90 fs/overlayfs/util.c:132
ovl_destroy_inode+0xc6/0x190 fs/overlayfs/super.c:182
destroy_inode+0xc4/0x1b0 fs/inode.c:311
iput_final fs/inode.c:1741 [inline]
iput.part.0+0x5a8/0x7f0 fs/inode.c:1767
iput+0x5c/0x80 fs/inode.c:1757
dentry_unlink_inode+0x295/0x480 fs/dcache.c:400
__dentry_kill+0x1d0/0x600 fs/dcache.c:603
shrink_kill fs/dcache.c:1048 [inline]
shrink_dentry_list+0x140/0x5d0 fs/dcache.c:1075
prune_dcache_sb+0xeb/0x150 fs/dcache.c:1156
super_cache_scan+0x32a/0x550 fs/super.c:221
do_shrink_slab+0x44f/0x11c0 mm/shrinker.c:435
shrink_slab_memcg mm/shrinker.c:548 [inline]
shrink_slab+0xa87/0x1310 mm/shrinker.c:626
shrink_one+0x493/0x7c0 mm/vmscan.c:4790
shrink_many mm/vmscan.c:4851 [inline]
lru_gen_shrink_node+0x89f/0x1750 mm/vmscan.c:4951
shrink_node mm/vmscan.c:5910 [inline]
kswapd_shrink_node mm/vmscan.c:6720 [inline]
balance_pgdat+0x1105/0x1970 mm/vmscan.c:6911
kswapd+0x5ea/0xbf0 mm/vmscan.c:7180
kthread+0x2c1/0x3a0 kernel/kthread.c:389
ret_from_fork+0x45/0x80 arch/x86/kernel/process.c:147
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
-> #0 (fs_reclaim){+.+.}-{0:0}:
check_prev_add kernel/locking/lockdep.c:3134 [inline]
check_prevs_add kernel/locking/lockdep.c:3253 [inline]
validate_chain kernel/locking/lockdep.c:3869 [inline]
__lock_acquire+0x2478/0x3b30 kernel/locking/lockdep.c:5137
lock_acquire kernel/locking/lockdep.c:5754 [inline]
lock_acquire+0x1b1/0x560 kernel/locking/lockdep.c:5719
__fs_reclaim_acquire mm/page_alloc.c:3801 [inline]
fs_reclaim_acquire+0x102/0x160 mm/page_alloc.c:3815
might_alloc include/linux/sched/mm.h:334 [inline]
slab_pre_alloc_hook mm/slub.c:3891 [inline]
slab_alloc_node mm/slub.c:3981 [inline]
kmem_cache_alloc_lru_noprof+0x58/0x2f0 mm/slub.c:4020
btrfs_alloc_inode+0x118/0xb20 fs/btrfs/inode.c:8411
alloc_inode+0x5d/0x230 fs/inode.c:261
iget5_locked fs/inode.c:1235 [inline]
iget5_locked+0x1c9/0x2c0 fs/inode.c:1228
btrfs_iget_locked fs/btrfs/inode.c:5590 [inline]
btrfs_iget_path fs/btrfs/inode.c:5607 [inline]
btrfs_iget+0xfb/0x230 fs/btrfs/inode.c:5636
add_conflicting_inode fs/btrfs/tree-log.c:5657 [inline]
copy_inode_items_to_log+0x1039/0x1e30 fs/btrfs/tree-log.c:5928
btrfs_log_inode+0xa48/0x4660 fs/btrfs/tree-log.c:6592
log_new_delayed_dentries fs/btrfs/tree-log.c:6363 [inline]
btrfs_log_inode+0x27dd/0x4660 fs/btrfs/tree-log.c:6718
btrfs_log_all_parents fs/btrfs/tree-log.c:6833 [inline]
btrfs_log_inode_parent+0x22ba/0x2a90 fs/btrfs/tree-log.c:7141
btrfs_log_dentry_safe+0x59/0x80 fs/btrfs/tree-log.c:7180
btrfs_sync_file+0x9c1/0xe10 fs/btrfs/file.c:1959
vfs_fsync_range+0x141/0x230 fs/sync.c:188
generic_write_sync include/linux/fs.h:2794 [inline]
btrfs_do_write_iter+0x584/0x10c0 fs/btrfs/file.c:1705
do_iter_readv_writev+0x504/0x780 fs/read_write.c:741
vfs_writev+0x36f/0xde0 fs/read_write.c:971
do_pwritev+0x1b2/0x260 fs/read_write.c:1072
__do_compat_sys_pwritev2 fs/read_write.c:1218 [inline]
__se_compat_sys_pwritev2 fs/read_write.c:1210 [inline]
__ia32_compat_sys_pwritev2+0x121/0x1b0 fs/read_write.c:1210
do_syscall_32_irqs_on arch/x86/entry/common.c:165 [inline]
__do_fast_syscall_32+0x73/0x120 arch/x86/entry/common.c:386
do_fast_syscall_32+0x32/0x80 arch/x86/entry/common.c:411
entry_SYSENTER_compat_after_hwframe+0x84/0x8e
other info that might help us debug this:
Chain exists of:
fs_reclaim --> btrfs_trans_num_extwriters --> &ei->log_mutex
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&ei->log_mutex);
lock(btrfs_trans_num_extwriters);
lock(&ei->log_mutex);
lock(fs_reclaim);
*** DEADLOCK ***
7 locks held by syz-executor.1/9919:
#0: ffff88802be20420 (sb_writers#23){.+.+}-{0:0}, at: do_pwritev+0x1b2/0x260 fs/read_write.c:1072
#1: ffff888065c0f8f0 (&sb->s_type->i_mutex_key#33){++++}-{3:3}, at: inode_lock include/linux/fs.h:791 [inline]
#1: ffff888065c0f8f0 (&sb->s_type->i_mutex_key#33){++++}-{3:3}, at: btrfs_inode_lock+0xc8/0x110 fs/btrfs/inode.c:385
#2: ffff888065c0f778 (&ei->i_mmap_lock){++++}-{3:3}, at: btrfs_inode_lock+0xee/0x110 fs/btrfs/inode.c:388
#3: ffff88802be20610 (sb_internal#4){.+.+}-{0:0}, at: btrfs_sync_file+0x95b/0xe10 fs/btrfs/file.c:1952
#4: ffff8880546323f0 (btrfs_trans_num_writers){++++}-{0:0}, at: join_transaction+0x430/0xf40 fs/btrfs/transaction.c:290
#5: ffff888054632418 (btrfs_trans_num_extwriters){++++}-{0:0}, at: join_transaction+0x430/0xf40 fs/btrfs/transaction.c:290
#6: ffff88804b569358 (&ei->log_mutex){+.+.}-{3:3}, at: btrfs_log_inode+0x39c/0x4660 fs/btrfs/tree-log.c:6481
stack backtrace:
CPU: 2 PID: 9919 Comm: syz-executor.1 Not tainted 6.10.0-rc2-syzkaller-00361-g061d1af7b030 #0
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0x116/0x1f0 lib/dump_stack.c:114
check_noncircular+0x31a/0x400 kernel/locking/lockdep.c:2187
check_prev_add kernel/locking/lockdep.c:3134 [inline]
check_prevs_add kernel/locking/lockdep.c:3253 [inline]
validate_chain kernel/locking/lockdep.c:3869 [inline]
__lock_acquire+0x2478/0x3b30 kernel/locking/lockdep.c:5137
lock_acquire kernel/locking/lockdep.c:5754 [inline]
lock_acquire+0x1b1/0x560 kernel/locking/lockdep.c:5719
__fs_reclaim_acquire mm/page_alloc.c:3801 [inline]
fs_reclaim_acquire+0x102/0x160 mm/page_alloc.c:3815
might_alloc include/linux/sched/mm.h:334 [inline]
slab_pre_alloc_hook mm/slub.c:3891 [inline]
slab_alloc_node mm/slub.c:3981 [inline]
kmem_cache_alloc_lru_noprof+0x58/0x2f0 mm/slub.c:4020
btrfs_alloc_inode+0x118/0xb20 fs/btrfs/inode.c:8411
alloc_inode+0x5d/0x230 fs/inode.c:261
iget5_locked fs/inode.c:1235 [inline]
iget5_locked+0x1c9/0x2c0 fs/inode.c:1228
btrfs_iget_locked fs/btrfs/inode.c:5590 [inline]
btrfs_iget_path fs/btrfs/inode.c:5607 [inline]
btrfs_iget+0xfb/0x230 fs/btrfs/inode.c:5636
add_conflicting_inode fs/btrfs/tree-log.c:5657 [inline]
copy_inode_items_to_log+0x1039/0x1e30 fs/btrfs/tree-log.c:5928
btrfs_log_inode+0xa48/0x4660 fs/btrfs/tree-log.c:6592
log_new_delayed_dentries fs/btrfs/tree-log.c:6363 [inline]
btrfs_log_inode+0x27dd/0x4660 fs/btrfs/tree-log.c:6718
btrfs_log_all_parents fs/btrfs/tree-log.c:6833 [inline]
btrfs_log_inode_parent+0x22ba/0x2a90 fs/btrfs/tree-log.c:7141
btrfs_log_dentry_safe+0x59/0x80 fs/btrfs/tree-log.c:7180
btrfs_sync_file+0x9c1/0xe10 fs/btrfs/file.c:1959
vfs_fsync_range+0x141/0x230 fs/sync.c:188
generic_write_sync include/linux/fs.h:2794 [inline]
btrfs_do_write_iter+0x584/0x10c0 fs/btrfs/file.c:1705
do_iter_readv_writev+0x504/0x780 fs/read_write.c:741
vfs_writev+0x36f/0xde0 fs/read_write.c:971
do_pwritev+0x1b2/0x260 fs/read_write.c:1072
__do_compat_sys_pwritev2 fs/read_write.c:1218 [inline]
__se_compat_sys_pwritev2 fs/read_write.c:1210 [inline]
__ia32_compat_sys_pwritev2+0x121/0x1b0 fs/read_write.c:1210
do_syscall_32_irqs_on arch/x86/entry/common.c:165 [inline]
__do_fast_syscall_32+0x73/0x120 arch/x86/entry/common.c:386
do_fast_syscall_32+0x32/0x80 arch/x86/entry/common.c:411
entry_SYSENTER_compat_after_hwframe+0x84/0x8e
RIP: 0023:0xf7334579
Code: b8 01 10 06 03 (...)
RSP: 002b:00000000f5f265ac EFLAGS: 00000292 ORIG_RAX: 000000000000017b
RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00000000200002c0
RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000000
RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000292 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
Fix this by ensuring we are under a NOFS scope whenever we call
btrfs_iget() during inode logging and log replay.
Reported-by: syzbot+8576cfa84070dce4d59b@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/000000000000274a3a061abbd928@google.com/
Fixes:
|
||
|
|
cebae292e0 |
btrfs: zoned: allocate dummy checksums for zoned NODATASUM writes
Shin'ichiro reported that when he's running fstests' test-case btrfs/167 on emulated zoned devices, he's seeing the following NULL pointer dereference in 'btrfs_zone_finish_endio()': Oops: general protection fault, probably for non-canonical address 0xdffffc0000000011: 0000 [#1] PREEMPT SMP KASAN NOPTI KASAN: null-ptr-deref in range [0x0000000000000088-0x000000000000008f] CPU: 4 PID: |
||
|
|
4eb4e85c4f |
btrfs: retry block group reclaim without infinite loop
If inc_block_group_ro systematically fails (e.g. due to ETXTBUSY from
swap) or btrfs_relocate_chunk systematically fails (from lack of
space), then this worker becomes an infinite loop.
At the very least, this strands the cleaner thread, but can also result
in hung tasks/RCU stalls on PREEMPT_NONE kernels and if the
reclaim_bgs_lock mutex is not contended.
I believe the best long term fix is to manage reclaim via work queue,
where we queue up a relocation on the triggering condition and re-queue
on failure. In the meantime, this is an easy fix to apply to avoid the
immediate pain.
Fixes:
|
||
|
|
f3a5367c67 |
btrfs: protect folio::private when attaching extent buffer folios
[BUG] Since v6.8 there are rare kernel crashes reported by various people, the common factor is bad page status error messages like this: BUG: Bad page state in process kswapd0 pfn:d6e840 page: refcount:0 mapcount:0 mapping:000000007512f4f2 index:0x2796c2c7c pfn:0xd6e840 aops:btree_aops ino:1 flags: 0x17ffffe0000008(uptodate|node=0|zone=2|lastcpupid=0x3fffff) page_type: 0xffffffff() raw: 0017ffffe0000008 dead000000000100 dead000000000122 ffff88826d0be4c0 raw: 00000002796c2c7c 0000000000000000 00000000ffffffff 0000000000000000 page dumped because: non-NULL mapping [CAUSE] Commit |
||
|
|
fb33eb2ef0 |
btrfs: fix leak of qgroup extent records after transaction abort
Qgroup extent records are created when delayed ref heads are created and
then released after accounting extents at btrfs_qgroup_account_extents(),
called during the transaction commit path.
If a transaction is aborted we free the qgroup records by calling
btrfs_qgroup_destroy_extent_records() at btrfs_destroy_delayed_refs(),
unless we don't have delayed references. We are incorrectly assuming
that no delayed references means we don't have qgroup extents records.
We can currently have no delayed references because we ran them all
during a transaction commit and the transaction was aborted after that
due to some error in the commit path.
So fix this by ensuring we btrfs_qgroup_destroy_extent_records() at
btrfs_destroy_delayed_refs() even if we don't have any delayed references.
Reported-by: syzbot+0fecc032fa134afd49df@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/0000000000004e7f980619f91835@google.com/
Fixes:
|
||
|
|
9d274c19a7 |
btrfs: fix crash on racing fsync and size-extending write into prealloc
We have been seeing crashes on duplicate keys in btrfs_set_item_key_safe(): BTRFS critical (device vdb): slot 4 key (450 108 8192) new key (450 108 8192) ------------[ cut here ]------------ kernel BUG at fs/btrfs/ctree.c:2620! invalid opcode: 0000 [#1] PREEMPT SMP PTI CPU: 0 PID: 3139 Comm: xfs_io Kdump: loaded Not tainted 6.9.0 #6 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-2.fc40 04/01/2014 RIP: 0010:btrfs_set_item_key_safe+0x11f/0x290 [btrfs] With the following stack trace: #0 btrfs_set_item_key_safe (fs/btrfs/ctree.c:2620:4) #1 btrfs_drop_extents (fs/btrfs/file.c:411:4) #2 log_one_extent (fs/btrfs/tree-log.c:4732:9) #3 btrfs_log_changed_extents (fs/btrfs/tree-log.c:4955:9) #4 btrfs_log_inode (fs/btrfs/tree-log.c:6626:9) #5 btrfs_log_inode_parent (fs/btrfs/tree-log.c:7070:8) #6 btrfs_log_dentry_safe (fs/btrfs/tree-log.c:7171:8) #7 btrfs_sync_file (fs/btrfs/file.c:1933:8) #8 vfs_fsync_range (fs/sync.c:188:9) #9 vfs_fsync (fs/sync.c:202:9) #10 do_fsync (fs/sync.c:212:9) #11 __do_sys_fdatasync (fs/sync.c:225:9) #12 __se_sys_fdatasync (fs/sync.c:223:1) #13 __x64_sys_fdatasync (fs/sync.c:223:1) #14 do_syscall_x64 (arch/x86/entry/common.c:52:14) #15 do_syscall_64 (arch/x86/entry/common.c:83:7) #16 entry_SYSCALL_64+0xaf/0x14c (arch/x86/entry/entry_64.S:121) So we're logging a changed extent from fsync, which is splitting an extent in the log tree. But this split part already exists in the tree, triggering the BUG(). This is the state of the log tree at the time of the crash, dumped with drgn (https://github.com/osandov/drgn/blob/main/contrib/btrfs_tree.py) to get more details than btrfs_print_leaf() gives us: >>> print_extent_buffer(prog.crashed_thread().stack_trace()[0]["eb"]) leaf 33439744 level 0 items 72 generation 9 owner 18446744073709551610 leaf 33439744 flags 0x100000000000000 fs uuid e5bd3946-400c-4223-8923-190ef1f18677 chunk uuid d58cb17e-6d02-494a-829a-18b7d8a399da item 0 key (450 INODE_ITEM 0) itemoff 16123 itemsize 160 generation 7 transid 9 size 8192 nbytes 8473563889606862198 block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0 sequence 204 flags 0x10(PREALLOC) atime 1716417703.220000000 (2024-05-22 15:41:43) ctime 1716417704.983333333 (2024-05-22 15:41:44) mtime 1716417704.983333333 (2024-05-22 15:41:44) otime 17592186044416.000000000 (559444-03-08 01:40:16) item 1 key (450 INODE_REF 256) itemoff 16110 itemsize 13 index 195 namelen 3 name: 193 item 2 key (450 XATTR_ITEM 1640047104) itemoff 16073 itemsize 37 location key (0 UNKNOWN.0 0) type XATTR transid 7 data_len 1 name_len 6 name: user.a data a item 3 key (450 EXTENT_DATA 0) itemoff 16020 itemsize 53 generation 9 type 1 (regular) extent data disk byte 303144960 nr 12288 extent data offset 0 nr 4096 ram 12288 extent compression 0 (none) item 4 key (450 EXTENT_DATA 4096) itemoff 15967 itemsize 53 generation 9 type 2 (prealloc) prealloc data disk byte 303144960 nr 12288 prealloc data offset 4096 nr 8192 item 5 key (450 EXTENT_DATA 8192) itemoff 15914 itemsize 53 generation 9 type 2 (prealloc) prealloc data disk byte 303144960 nr 12288 prealloc data offset 8192 nr 4096 ... So the real problem happened earlier: notice that items 4 (4k-12k) and 5 (8k-12k) overlap. Both are prealloc extents. Item 4 straddles i_size and item 5 starts at i_size. Here is the state of the filesystem tree at the time of the crash: >>> root = prog.crashed_thread().stack_trace()[2]["inode"].root >>> ret, nodes, slots = btrfs_search_slot(root, BtrfsKey(450, 0, 0)) >>> print_extent_buffer(nodes[0]) leaf 30425088 level 0 items 184 generation 9 owner 5 leaf 30425088 flags 0x100000000000000 fs uuid e5bd3946-400c-4223-8923-190ef1f18677 chunk uuid d58cb17e-6d02-494a-829a-18b7d8a399da ... item 179 key (450 INODE_ITEM 0) itemoff 4907 itemsize 160 generation 7 transid 7 size 4096 nbytes 12288 block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0 sequence 6 flags 0x10(PREALLOC) atime 1716417703.220000000 (2024-05-22 15:41:43) ctime 1716417703.220000000 (2024-05-22 15:41:43) mtime 1716417703.220000000 (2024-05-22 15:41:43) otime 1716417703.220000000 (2024-05-22 15:41:43) item 180 key (450 INODE_REF 256) itemoff 4894 itemsize 13 index 195 namelen 3 name: 193 item 181 key (450 XATTR_ITEM 1640047104) itemoff 4857 itemsize 37 location key (0 UNKNOWN.0 0) type XATTR transid 7 data_len 1 name_len 6 name: user.a data a item 182 key (450 EXTENT_DATA 0) itemoff 4804 itemsize 53 generation 9 type 1 (regular) extent data disk byte 303144960 nr 12288 extent data offset 0 nr 8192 ram 12288 extent compression 0 (none) item 183 key (450 EXTENT_DATA 8192) itemoff 4751 itemsize 53 generation 9 type 2 (prealloc) prealloc data disk byte 303144960 nr 12288 prealloc data offset 8192 nr 4096 Item 5 in the log tree corresponds to item 183 in the filesystem tree, but nothing matches item 4. Furthermore, item 183 is the last item in the leaf. btrfs_log_prealloc_extents() is responsible for logging prealloc extents beyond i_size. It first truncates any previously logged prealloc extents that start beyond i_size. Then, it walks the filesystem tree and copies the prealloc extent items to the log tree. If it hits the end of a leaf, then it calls btrfs_next_leaf(), which unlocks the tree and does another search. However, while the filesystem tree is unlocked, an ordered extent completion may modify the tree. In particular, it may insert an extent item that overlaps with an extent item that was already copied to the log tree. This may manifest in several ways depending on the exact scenario, including an EEXIST error that is silently translated to a full sync, overlapping items in the log tree, or this crash. This particular crash is triggered by the following sequence of events: - Initially, the file has i_size=4k, a regular extent from 0-4k, and a prealloc extent beyond i_size from 4k-12k. The prealloc extent item is the last item in its B-tree leaf. - The file is fsync'd, which copies its inode item and both extent items to the log tree. - An xattr is set on the file, which sets the BTRFS_INODE_COPY_EVERYTHING flag. - The range 4k-8k in the file is written using direct I/O. i_size is extended to 8k, but the ordered extent is still in flight. - The file is fsync'd. Since BTRFS_INODE_COPY_EVERYTHING is set, this calls copy_inode_items_to_log(), which calls btrfs_log_prealloc_extents(). - btrfs_log_prealloc_extents() finds the 4k-12k prealloc extent in the filesystem tree. Since it starts before i_size, it skips it. Since it is the last item in its B-tree leaf, it calls btrfs_next_leaf(). - btrfs_next_leaf() unlocks the path. - The ordered extent completion runs, which converts the 4k-8k part of the prealloc extent to written and inserts the remaining prealloc part from 8k-12k. - btrfs_next_leaf() does a search and finds the new prealloc extent 8k-12k. - btrfs_log_prealloc_extents() copies the 8k-12k prealloc extent into the log tree. Note that it overlaps with the 4k-12k prealloc extent that was copied to the log tree by the first fsync. - fsync calls btrfs_log_changed_extents(), which tries to log the 4k-8k extent that was written. - This tries to drop the range 4k-8k in the log tree, which requires adjusting the start of the 4k-12k prealloc extent in the log tree to 8k. - btrfs_set_item_key_safe() sees that there is already an extent starting at 8k in the log tree and calls BUG(). Fix this by detecting when we're about to insert an overlapping file extent item in the log tree and truncating the part that would overlap. CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
|
f13e01b89d |
btrfs: ensure fast fsync waits for ordered extents after a write failure
If a write path in COW mode fails, either before submitting a bio for the new extents or an actual IO error happens, we can end up allowing a fast fsync to log file extent items that point to unwritten extents. This is because dropping the extent maps happens when completing ordered extents, at btrfs_finish_one_ordered(), and the completion of an ordered extent is executed in a work queue. This can result in a fast fsync to start logging file extent items based on existing extent maps before the ordered extents complete, therefore resulting in a log that has file extent items that point to unwritten extents, resulting in a corrupt file if a crash happens after and the log tree is replayed the next time the fs is mounted. This can happen for both direct IO writes and buffered writes. For example consider a direct IO write, in COW mode, that fails at btrfs_dio_submit_io() because btrfs_extract_ordered_extent() returned an error: 1) We call btrfs_finish_ordered_extent() with the 'uptodate' parameter set to false, meaning an error happened; 2) That results in marking the ordered extent with the BTRFS_ORDERED_IOERR flag; 3) btrfs_finish_ordered_extent() queues the completion of the ordered extent - so that btrfs_finish_one_ordered() will be executed later in a work queue. That function will drop extent maps in the range when it's executed, since the extent maps point to unwritten locations (signaled by the BTRFS_ORDERED_IOERR flag); 4) After calling btrfs_finish_ordered_extent() we keep going down the write path and unlock the inode; 5) After that a fast fsync starts and locks the inode; 6) Before the work queue executes btrfs_finish_one_ordered(), the fsync task sees the extent maps that point to the unwritten locations and logs file extent items based on them - it does not know they are unwritten, and the fast fsync path does not wait for ordered extents to complete, which is an intentional behaviour in order to reduce latency. For the buffered write case, here's one example: 1) A fast fsync begins, and it starts by flushing delalloc and waiting for the writeback to complete by calling filemap_fdatawait_range(); 2) Flushing the dellaloc created a new extent map X; 3) During the writeback some IO error happened, and at the end io callback (end_bbio_data_write()) we call btrfs_finish_ordered_extent(), which sets the BTRFS_ORDERED_IOERR flag in the ordered extent and queues its completion; 4) After queuing the ordered extent completion, the end io callback clears the writeback flag from all pages (or folios), and from that moment the fast fsync can proceed; 5) The fast fsync proceeds sees extent map X and logs a file extent item based on extent map X, resulting in a log that points to an unwritten data extent - because the ordered extent completion hasn't run yet, it happens only after the logging. To fix this make btrfs_finish_ordered_extent() set the inode flag BTRFS_INODE_NEEDS_FULL_SYNC in case an error happened for a COW write, so that a fast fsync will wait for ordered extent completion. Note that this issues of using extent maps that point to unwritten locations can not happen for reads, because in read paths we start by locking the extent range and wait for any ordered extents in the range to complete before looking for extent maps. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
|
440861b1a0 |
btrfs: re-introduce 'norecovery' mount option
Although 'norecovery' mount option was marked as deprecated for a long
time and a warning message was printed during the deprecation window,
it's still actively utilized by several projects that need a safer way
to mount a btrfs without any writes.
Furthermore this 'norecovery' mount option is supported by other major
filesystems, which makes it less clear what's our motivation to remove
it.
Re-introduce the 'norecovery' mount option, and output a message to recommend
'rescue=nologreplay' option.
Link: https://lore.kernel.org/linux-btrfs/ZkxZT0J-z0GYvfy8@gardel-login/#t
Link: https://github.com/systemd/systemd/pull/32892
Link: https://bugzilla.suse.com/show_bug.cgi?id=1222429
Reported-by: Lennart Poettering <lennart@poettering.net>
Reported-by: Jiri Slaby <jslaby@suse.com>
Fixes:
|
||
|
|
dddff821b6 |
btrfs: fix end of tree detection when searching for data extent ref
At lookup_extent_data_ref() we are incorrectly checking if we are at the
last slot of the last leaf in the extent tree. We are returning -ENOENT
if btrfs_next_leaf() returns a value greater than 1, but btrfs_next_leaf()
never returns anything greater than 1:
1) It returns < 0 on error;
2) 0 if there is a next leaf (or a new item was added to the end of the
current leaf after releasing the path);
3) 1 if there are no more leaves (and no new items were added to the last
leaf after releasing the path).
So fix this by checking if the return value is greater than zero instead
of being greater than one.
Fixes:
|
||
|
|
b4e585fffc |
btrfs: scrub: initialize ret in scrub_simple_mirror() to fix compilation warning
The following error message is displayed: ../fs/btrfs/scrub.c:2152:9: error: ‘ret’ may be used uninitialized in this function [-Werror=maybe-uninitialized]" Compiler version: gcc version: (Debian 10.2.1-6) 10.2.1 20210110 Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Lu Yao <yaolu@kylinos.cn> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
|
0090d6e1b2 |
btrfs: zoned: fix use-after-free due to race with dev replace
While loading a zone's info during creation of a block group, we can race with a device replace operation and then trigger a use-after-free on the device that was just replaced (source device of the replace operation). This happens because at btrfs_load_zone_info() we extract a device from the chunk map into a local variable and then use the device while not under the protection of the device replace rwsem. So if there's a device replace operation happening when we extract the device and that device is the source of the replace operation, we will trigger a use-after-free if before we finish using the device the replace operation finishes and frees the device. Fix this by enlarging the critical section under the protection of the device replace rwsem so that all uses of the device are done inside the critical section. CC: stable@vger.kernel.org # 6.1.x: |
||
|
|
2b8aa78cf1 |
btrfs: qgroup: fix qgroup id collision across mounts
If we delete subvolumes whose ID is the largest in the filesystem, then
unmount and mount again, then btrfs_init_root_free_objectid on the
tree_root will select a subvolid smaller than that one and thus allow
reusing it.
If we are also using qgroups (and particularly squotas) it is possible
to delete the subvol without deleting the qgroup. In that case, we will
be able to create a new subvol whose id already has a level 0 qgroup.
This will result in re-using that qgroup which would then lead to
incorrect accounting.
Fixes:
|
||
|
|
1fa7603d56 |
btrfs: qgroup: update rescan message levels and error codes
On filesystems without enabled quotas there's still a warning message in the logs when rescan is called. In that case it's not a problem that should be reported, rescan can be called unconditionally. Change the error code to ENOTCONN which is used for 'quotas not enabled' elsewhere. Remove message (also a warning) when rescan is called during an ongoing rescan, this brings no useful information and the error code is sufficient. Change message levels to debug for now, they can be removed eventually. CC: stable@vger.kernel.org # 6.6+ Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
|
0e39c9e524 |
btrfs: qgroup: fix initialization of auto inherit array
The "i++" was accidentally left out so it just sets qgids[0] over and
over.
This can lead to unexpected problems, as the groups[1:] would be all 0,
leading to later find_qgroup_rb() unable to find a qgroup and cause
snapshot creation failure.
Fixes:
|
||
|
|
bc00965dbf |
btrfs: count super block write errors in device instead of tracking folio error state
Currently the error status of super block write is tracked in page/folio status bit Error. For that we need to keep the reference for the whole duration of write and wait. Count the number of superblock writeback errors in the btrfs_device. That means we don't need the folio to stay around until it's waited for, and can avoid the extra call to folio_get/put. Also remove a mention of PageError in a comment as it's the last mention of the page Error state. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
|
617fb10ea8 |
btrfs: use the folio iterator in btrfs_end_super_write()
Iterate over folios instead of bvecs. Switch the order of unlock and put to be the usual order; we know this folio can't be put until it's been waited for, but that's fragile. Remove the calls to ClearPageUptodate / SetPageUptodate -- if PAGE_SIZE is larger than BTRFS_SUPER_INFO_SIZE, we'd be marking the entire folio uptodate without having actually initialised all the bytes in the page. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
|
f93ee0df51 |
btrfs: convert super block writes to folio in write_dev_supers()
This is a direct conversion from pages to folios, assuming single page folio. Also removes some calls to obsolete APIs and some hidden calls to compound_head(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
|
c94b7349b8 |
btrfs: convert super block writes to folio in wait_dev_supers()
This is a direct conversion from pages to folios, assuming single page folio. Also removes a few calls to compound_head() and calls to obsolete APIs. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
|
8fde439b2d |
bio: Export bio_add_folio_nofail to modules
Several modules use __bio_add_page() today and may need to be converted to bio_add_folio_nofail(). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
|
58a774ca16 |
btrfs: remove duplicate included header from fs.h
Remove duplicate included header file linux/blkdev.h . Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
|
6b0a63a4fa |
btrfs: add a cached state to extent_clear_unlock_delalloc
Now that we have the lock_extent tightly coupled with extent_clear_unlock_delalloc we can add a cached state to extent_clear_unlock_delalloc and benefit from skipping the extra lookup when we're doing cow. Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
|
8325f41a56 |
btrfs: push extent lock down in submit_one_async_extent
We don't need to include the time we spend in the allocator under our extent lock protection, move it after the allocator and make sure we lock the extent in the error case to ensure we're not clearing these bits without the extent lock held. Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
|
d456c25dbb |
btrfs: push lock_extent down in cow_file_range()
Now that we've got the extent lock pushed into cow_file_range() we can push it further down into the allocation loop. This allows us to only hold the extent lock during the dropping of the extent map range and inserting the ordered extent. This makes the error case a little trickier as we'll now have to lock the range before clearing any of the other extent bits for the range, but this is the error path so is less performance critical. Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
|
cd241a8f55 |
btrfs: move can_cow_file_range_inline() outside of the extent lock
These checks aren't reliant on the extent lock. Move this up into cow_file_range_inline(), and then update encoded writes to call this check before calling __cow_file_range_inline(). This will allow us to skip the extent lock if we're not able to inline the given extent. Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
|
0ab540995a |
btrfs: push lock_extent into cow_file_range_inline
Now that we've pushed the lock_extent() into cow_file_range() we can push the extent locking into cow_file_range_inline() and move the lock_extent in cow_file_range() to after we call cow_file_range_inline(). Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
|
a0766d8f35 |
btrfs: push extent lock into cow_file_range
Now that cow_file_range is the only function that is called with the range locked, push this call into cow_file_range so we can further narrow the scope. Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
|
00009d7bcb |
btrfs: push extent lock into run_delalloc_cow
This is used by zoned but also as the fallback for uncompressed extents when we fail to compress the ranges. Push the extent lock into run_dealloc_cow(), and adjust the compression case to take the extent lock after calling run_delalloc_cow(). Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
|
0e128d4e41 |
btrfs: remove unlock_extent from run_delalloc_compressed
Since we immediately unlock the extent range when we enter run_delalloc_compressed() simply move the lock_extent() down to cover cow_file_range() and then remove the unlock_extent() from run_delalloc_compressed. Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
|
aa56b0aa91 |
btrfs: push extent lock down in run_delalloc_nocow
run_delalloc_nocow is a little special because we use the file extents to see if we can nocow a range. We don't actually need the protection of the extent lock to look at the file extents at this point however. We are currently holding the page lock for this range, so we are protected from anybody who would simultaneously be modifying the file extent items for this range. * mmap() - we're holding the page lock. * buffered writes - we're holding the page lock. * direct writes - we're holding the page lock and direct IO has to flush page cache before it's able to continue. * fallocate() - all callers flush the range and wait on ordered extents while holding the inode lock and the mmap lock, so we are again saved by the page lock. We want to use the extent lock to protect 1) The mapping tree for the given range. 2) The ordered extents for the given range. 3) The io_tree for the given range. Push the extent lock down to cover these operations. In the fallback_to_cow() case we simply lock before doing anything and rely on the cow_file_range() helper to handle it's range properly. Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
|
0ed30c17f6 |
btrfs: adjust while loop condition in run_delalloc_nocow
We have the following pattern
while (1) {
if (cur_offset > end)
break;
}
Which is just
while (cur_offset <= end) {
...
}
so adjust the code to be more clear.
Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
||
|
|
7c9acd440f |
btrfs: push extent lock into run_delalloc_nocow
run_delalloc_nocow is a bit special as it walks through the file extents for the inode and determines what it can nocow and what it can't. This is the more complicated area for extent locking, so start with this function. Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
|
c0707c9e1e |
btrfs: push the extent lock into btrfs_run_delalloc_range
We want to limit the scope of the extent lock to be around operations that can change in flight. Currently we hold the extent lock through the entire writepage operation, which isn't really necessary. We want to protect to make sure nobody has updated DELALLOC. In find_lock_delalloc_range we must lock the range in order to validate the contents of our io_tree. However once we've done that we're safe to unlock the range and continue, as we have the page lock already held for the range. We are protected from all operations at this point. * mmap() - we're holding the page lock, thus are protected. * buffered writes - again, we're protected because we take the page lock for the first and last page in our range for buffered writes so we won't create new delalloc ranges in this area. * direct IO - we invalidate pagecache before attempting to write a new area, which requires the page lock, so again are protected once we're holding the page lock on this range. Additionally this behavior actually already exists for compressed, we unlock the range as soon as we start to process the async extents, and re-lock it during compression. So this is completely safe, and makes the locking more consistent. Make this simple by just pushing the extent lock into btrfs_run_delalloc_range. From there followup patches will push the lock further down into its users. Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
|
7034674b8a |
btrfs: lock extent when doing inline extent in compression
We currently don't lock the extent when we're doing a cow_file_range_inline() for a compressed extent. This isn't a problem necessarily, but it's inconsistent with the rest of our usage of cow_file_range_inline(). This also leads to some extra weird logic around whether the extent is locked or not. Fix this to lock the extent before calling cow_file_range_inline() in compression to make it consistent with the rest of the inline users. In future patches this will be pushed down into the cow_file_range_inline() helper, so we're fine with the quick and dirty locking here. This patch exists to make the behavior change obvious. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
|
0586d0a89e |
btrfs: move extent bit and page cleanup into cow_file_range_inline
We duplicate the extent cleanup for cow_file_range_inline() in the cow and compressed case. The encoded case doesn't need to do cleanup the same way, so rename cow_file_range_inline to __cow_file_range_inline and then make cow_file_range_inline handle the extent cleanup appropriately, and update the callers. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
|
0332967b4d |
btrfs: unlock all the pages with successful inline extent creation
Since
|
||
|
|
6eecfa2240 |
btrfs: push all inline logic into cow_file_range
Currently we have a lot of duplicated checks of if (start == 0 && fs_info->sectorsize == PAGE_SIZE) cow_file_range_inline(); Instead of duplicating this check everywhere, consolidate all of the inline extent logic into a helper which documents all of the checks and then use that helper inside of cow_file_range_inline(). With this we can clean up all of the calls to either unconditionally call cow_file_range_inline(), or at least reduce the checks we're doing before we call cow_file_range_inline(); Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
|
aa5ccf2917 |
btrfs: handle errors in btrfs_reloc_clone_csums properly
In the cow path we will clone the reloc csums for relocated data extents, and if there's an error we already have an ordered extent and rely on the ordered extent finishing to clean everything up. There's a problem however, we don't mark the ordered extent with an error, we pretend like everything was just fine. If we were at the end of our range we won't actually bubble up this error anywhere, and we could end up inserting an extent that doesn't have csums where it should have them. Fix this by adding a helper to mark the ordered extent with an error, and then use this when we fail to lookup the csums in btrfs_reloc_clone_csums. Use this helper in the other place where we use the same pattern while we're here. This will prevent us from erroneously inserting the extent that doesn't have the required checksums. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
|
e98bf64f7a |
btrfs: add extra sanity checks for create_io_em()
The function create_io_em() is called before we submit an IO, to update
the in-memory extent map for the involved range.
This patch changes the following aspects:
- Does not allow BTRFS_ORDERED_NOCOW type
For real NOCOW (excluding NOCOW writes into preallocated ranges)
writes, we never call create_io_em(), as we does not need to update
the extent map at all.
So remove the sanity check allowing BTRFS_ORDERED_NOCOW type.
- Add extra sanity checks
* PREALLOC
- @block_len == len
For uncompressed writes.
* REGULAR
- @block_len == @orig_block_len == @ram_bytes == @len
We're creating a new uncompressed extent, and referring all of it.
- @orig_start == @start
We haven no offset inside the extent.
* COMPRESSED
- valid @compress_type
- @len <= @ram_bytes
This is to co-operate with encoded writes, which can cause a new
file extent referring only part of a uncompressed extent.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
||
|
|
4bdc558bf9 |
btrfs: simplify the inline extent map creation
With the tree-checker ensuring all inline file extents starts at file offset 0 and has a length no larger than sectorsize, we can simplify the calculation to assigned those fixes values directly. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |