Commit Graph

4867 Commits

Author SHA1 Message Date
Kent Overstreet
8e4d28036c bcachefs: Don't aggressively discard the journal
We frequently use 'bcachefs list_journal -a' for debugging, as it
provides a record of all btree transactions, and a history of what
happened.

But it's not so useful if we immediately discard journal buckets right
after they're no longer dirty.

This tweaks journal reclaim to only discard when we're low on space,
keeping the journal mostly un-discarded.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-07 17:10:10 -04:00
Kent Overstreet
da18dabc37 bcachefs: Ensure superblock gets written when we go ERO
When we go emergency read-only, make sure we do a final write_super() to
persist counters and error counts - this can be critical for piecing
together what fsck was doing.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-07 17:09:59 -04:00
Kent Overstreet
2fea3aa76e bcachefs: Filter out harmless EROFS error messages
These just indicate that we're shutting down.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-07 16:58:32 -04:00
Kent Overstreet
473f09f362 bcachefs: journal_shutdown is EROFS, not EIO
We often filter out EROFS errors to avoid log spew after an emergency
shutdown - journal_shutdown is just another emergency shutdown error.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-07 16:58:26 -04:00
Kent Overstreet
9c61856099 bcachefs: Call bch2_fs_start before getting vfs superblock
This reverts

1fdbe0b184 bcachefs: Make sure c->vfs_sb is set before starting fs

switched up bch2_fs_get_tree() so that we got a superblock before
calling bch2_fs_start, so that c->vfs_sb would always be initialized
while the filesystem was active.

This turned out not to be necessary, because blk_holder_ops were
implemented using our own locking, not vfs locking.

And this had the side effect of creating a super_block and doing our
full recovery (including potentially fsck) before setting SB_BORN, which
causes things like sync calls to hang until our recovery is finished.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-05 16:06:35 -04:00
Kent Overstreet
aed4ccbf45 bcachefs: fix hung task timeout in journal read
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-05 14:21:28 -04:00
Kent Overstreet
7a69fa6571 bcachefs: Add missing barriers before wake_up_bit()
wake_up() doesn't require a barrier - but wake_up_bit() does.

This only affected non x86, and primarily lead to lost wakeups after
btree node reads.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-05 14:19:10 -04:00
Kent Overstreet
50a7b899a0 bcachefs: Ensure proper write alignment
There was a buggy version of bcachefs-tools which picked misaligned
bucket sizes when formatting, and we're also about to do dynamic block
sizes - which will allow picking logical block size or physical block
size of the device per-write, allowing for better compression ratios at
the cost of slightly worse write performance (i.e. forcing the device to
do RMW or extra buffering).

To account for this, tweak bch2_alloc_sectors_start() to properly align
open_buckets to the blocksize of the write we're about to do.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-05 14:19:01 -04:00
Kent Overstreet
844f766e02 bcachefs: Improve want_cached_ptr()
If promote target isn't set, rebalance should still leave a cached copy
on the faster device.

Fall back to foreground_target if it's set, or allow a cached copy on
any device if neither are set.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-05 14:16:20 -04:00
Kent Overstreet
df2e19a883 bcachefs: thread_with_stdio: fix spinning instead of exiting
bch2_stdio_redirect_vprintf() was missing a check for stdio->done, i.e.
exiting.

This caused the thread attempting to print to spin, and since it was
being called from the kthread ran by thread_with_stdio, the userspace
side hung as well.

Change it to return -EPIPE - i.e. writing to a pipe that's been closed.

Reported-by: Jan Solanti <jhs@psonet.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-04 14:00:14 -04:00
Alan Huang
6846100b00 bcachefs: Remove incorrect __counted_by annotation
This actually reverts 86e92eeeb2 ("bcachefs: Annotate struct bch_xattr
with __counted_by()").

After the x_name, there is a value. According to the disscussion[1],
__counted_by assumes that the flexible array member contains exactly
the amount of elements that are specified. Now there are users came across
a false positive detection of an out of bounds write caused by
the __counted_by here[2], so revert that.

[1] https://lore.kernel.org/lkml/Zv8VDKWN1GzLRT-_@archlinux/T/#m0ce9541c5070146320efd4f928cc1ff8de69e9b2
[2] https://privatebin.net/?a0d4e97d590d71e1#9bLmp2Kb5NU6X6cZEucchDcu88HzUQwHUah8okKPReEt

Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-01 16:38:58 -04:00
Kent Overstreet
28580052e6 bcachefs: add missing sched_annotate_sleep()
00594 ------------[ cut here ]------------
00594 do not call blocking ops when !TASK_RUNNING; state=2 set at [<000000003e51ef4a>] prepare_to_wait_event+0x5c/0x1c0
00594 WARNING: CPU: 12 PID: 1117 at kernel/sched/core.c:8741 __might_sleep+0x74/0x88
00594 Modules linked in:
00594 CPU: 12 UID: 0 PID: 1117 Comm: umount Not tainted 6.15.0-rc4-ktest-g3a72e369412d #21845 PREEMPT
00594 Hardware name: linux,dummy-virt (DT)
00594 pstate: 60001005 (nZCv daif -PAN -UAO -TCO -DIT +SSBS BTYPE=--)
00594 pc : __might_sleep+0x74/0x88
00594 lr : __might_sleep+0x74/0x88
00594 sp : ffffff80c8d67a90
00594 x29: ffffff80c8d67a90 x28: ffffff80f5903500 x27: 0000000000000000
00594 x26: 0000000000000000 x25: ffffff80cf5002a0 x24: ffffffc087dad000
00594 x23: ffffff80c8d67b40 x22: 0000000000000000 x21: 0000000000000000
00594 x20: 0000000000000242 x19: ffffffc080b92020 x18: 00000000ffffffff
00594 x17: 30303c5b20746120 x16: 74657320323d6574 x15: 617473203b474e49
00594 x14: 0000000000000001 x13: 00000000000c0000 x12: ffffff80facc0000
00594 x11: 0000000000000001 x10: 0000000000000001 x9 : ffffffc0800b0774
00594 x8 : c0000000fffbffff x7 : ffffffc087dac670 x6 : 00000000015fffa8
00594 x5 : ffffff80facbffa8 x4 : ffffff80fbd30b90 x3 : 0000000000000000
00594 x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffffff80f5903500
00594 Call trace:
00594  __might_sleep+0x74/0x88 (P)
00594  __mutex_lock+0x64/0x8d8
00594  mutex_lock_nested+0x28/0x38
00594  bch2_fs_ec_flush+0xf8/0x128
00594  __bch2_fs_read_only+0x54/0x1d8
00594  bch2_fs_read_only+0x3e0/0x438
00594  __bch2_fs_stop+0x5c/0x250
00594  bch2_put_super+0x18/0x28
00594  generic_shutdown_super+0x6c/0x140
00594  bch2_kill_sb+0x1c/0x38
00594  deactivate_locked_super+0x54/0xd0
00594  deactivate_super+0x70/0x90
00594  cleanup_mnt+0xec/0x188
00594  __cleanup_mnt+0x18/0x28
00594  task_work_run+0x90/0xd8
00594  do_notify_resume+0x138/0x148
00594  el0_svc+0x9c/0xa0
00594  el0t_64_sync_handler+0x104/0x130
00594  el0t_64_sync+0x154/0x158

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-01 13:54:58 -04:00
Kent Overstreet
e2699274d5 bcachefs: Fix __bch2_dev_group_set()
bch2_sb_disk_groups_to_cpu() goes off of the superblock member info, so
we need to set that first.

Reported-by: Stijn Tintel <stijn@linux-ipv6.be>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-01 12:22:10 -04:00
Kent Overstreet
e660d7ca74 bcachefs: Kill ERO for i_blocks check in truncate
Replace with logging the error in the superblock.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-01 06:19:58 -04:00
Kent Overstreet
3a72e36941 bcachefs: check for inode.bi_sectors underflow
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-01 06:19:58 -04:00
Kent Overstreet
05450c48a3 bcachefs: Kill ERO in __bch2_i_sectors_acct()
We won't be root causing this in the immediate future, and it's fairly
innocuous - so just log it in the superblock.

https://github.com/koverstreet/bcachefs/issues/869

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-01 06:19:58 -04:00
Kent Overstreet
5e63d579e7 bcachefs: readdir fixes
- Don't call bch2_trans_relock() after dir_emit(); taking a transaction
  restart here will cause us to emit the same dirent to userspace twice

- Fix incorrect checking of the return value on dir_emit(): "true" means
  success, keep going, but bch2_dir_emit() needs to return true when
  we're finished iterating.

https://github.com/koverstreet/bcachefs/issues/867

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-30 11:49:34 -04:00
Kent Overstreet
2feaa92c7c bcachefs: improve missing journal write device error message
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-30 11:49:28 -04:00
Kent Overstreet
dbe4674802 bcachefs: Topology error after insert is now an ERO
A user hit this, and this will naturally be easier to debug if we don't
panic.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-28 22:42:17 -04:00
Kent Overstreet
9a4a858c9b bcachefs: Use bch2_kvmalloc() for journal keys array
We can hit this limit fairly easy when we have to reconstuct large
amounts of alloc info on large filesystems.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-28 22:42:17 -04:00
Kent Overstreet
e5a3b8cf33 bcachefs: More informative error message when shutting down due to error
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-28 22:42:17 -04:00
Kent Overstreet
652dd6558b bcachefs: btree_root_unreadable_and_scan_found_nothing autofix for non data btrees
If loosing a btree won't cause data loss - i.e. it's an alloc btree, or
we can easily reconstruct it - we shouldn't require user action to
continue repair.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-28 22:42:17 -04:00
Kent Overstreet
c366b1672d bcachefs: btree_node_data_missing is now autofix
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-28 16:46:13 -04:00
Kent Overstreet
eca5b56ccf bcachefs: Don't generate alloc updates to invalid buckets
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-28 16:46:13 -04:00
Kent Overstreet
e7f1a52849 bcachefs: Improve bch2_dev_bucket_missing()
More useful error message.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-28 16:46:13 -04:00
Kent Overstreet
002466446a bcachefs: fix bch2_dev_buckets_resize()
The resize memcpy path was totally busted.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-28 16:46:13 -04:00
Kent Overstreet
9e9c28acfd bcachefs: Add upgrade table entry from 0.14
There are a few errors that needed to be marked as autofix.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-28 16:46:12 -04:00
Kent Overstreet
3c24020119 bcachefs: Run BCH_RECOVERY_PASS_reconstruct_snapshots on missing subvol -> snapshot
Fix this repair path.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-28 16:46:12 -04:00
Kent Overstreet
bdc32a10a2 bcachefs: Add missing utf8_unload()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-28 16:46:12 -04:00
Kent Overstreet
70c3d89f49 bcachefs: Emit unicode version message on startup
fstests expects this

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-28 16:46:12 -04:00
Kent Overstreet
c83311c5b9 bcachefs: Use generic_set_sb_d_ops for standard casefolding d_ops
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-28 16:46:12 -04:00
Kent Overstreet
a2f546330e bcachefs: Fix losing return code in next_fiemap_extent()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-28 16:46:12 -04:00
Kent Overstreet
d1b0f9aa73 bcachefs: Rework fiemap transaction restart handling
Restart handling in the previous patch was incorrect, so: move btree
operations into a separate helper, and run it with a lockrestart_do().

Additionally, clarify whether pagecache or the btree takes precedence.

Right now, the btree takes precedence: this is incorrect, but it's
needed to pass fstests. Add a giant comment explaining why.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-24 19:10:29 -04:00
Brian Foster
b9b0494017 bcachefs: add fiemap delalloc extent detection
bcachefs currently populates fiemap data from the extents btree.
This works correctly when the fiemap sync flag is provided, but if
not, it skips all delalloc extents that have not yet been flushed.
This is because delalloc extents from buffered writes are first
stored as reservation in the pagecache, and only become resident in
the extents btree after writeback completes.

Update the fiemap implementation to process holes between extents by
scanning pagecache for data, via seek data/hole. If a valid data
range is found over a hole in the extent btree, fake up an extent
key and flag the extent as delalloc for reporting to userspace.

Note that this does not necessarily change behavior for the case
where there is dirty pagecache over already written extents, where
when in COW mode, writeback will allocate new blocks for the
underlying ranges. The existing behavior is consistent with btrfs
and it is recommended to use the sync flag for the most up to date
extent state from fiemap.

Signed-off-by: Brian Foster <bfoster@redhat.com>
2025-04-24 19:10:29 -04:00
Brian Foster
2d55a63709 bcachefs: refactor fiemap processing into extent helper and struct
The bulk of the loop in bch2_fiemap() involves processing the
current extent key from the iter, including following indirections
and trimming the extent size and such. This patch makes a few
changes to reduce the size of the loop and facilitate future changes
to support delalloc extents.

Define a new bch_fiemap_extent structure to wrap the bkey buffer
that holds the extent key to report to userspace along with
associated fiemap flags. Update bch2_fill_extent() to take the
bch_fiemap_extent as a param instead of the individual fields.
Finally, lift the bulk of the extent processing into a
bch2_fiemap_extent() helper that takes the current key and formats
the bch_fiemap_extent appropriately for the fill function.

No functional changes intended by this patch.

Signed-off-by: Brian Foster <bfoster@redhat.com>
2025-04-24 19:10:29 -04:00
Brian Foster
d020a9fb11 bcachefs: track current fiemap offset in start variable
Signed-off-by: Brian Foster <bfoster@redhat.com>
2025-04-24 19:10:28 -04:00
Brian Foster
28d2d19ccc bcachefs: drop duplicate fiemap sync flag
FIEMAP_FLAG_SYNC handling was deliberately moved into core code in
commit 45dd052e67 ("fs: handle FIEMAP_FLAG_SYNC in fiemap_prep"),
released in kernel v5.8. Update bcachefs accordingly.

Signed-off-by: Brian Foster <bfoster@redhat.com>
2025-04-24 19:10:28 -04:00
Kent Overstreet
353739f1d1 bcachefs: Fix btree_iter_peek_prev() at end of inode
At the end of the inode, on an extents iterator, peek_slot() has to
advance to the next position to avoid returning a 0 size extent, which
is not allowed.

Changing iter->pos confuses peek_prev(), but we don't need to call
peek_slot() in this case.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-24 19:09:52 -04:00
Kent Overstreet
c4f89a1d35 bcachefs: Make btree_iter_peek_prev() assert more precise
The issue this assert is guarding against is that in
BTREE_ITER_filter_snapshots mode we only want to be iterating within a
single inode number - if we iterate into another inode number with keys
for a different snapshot tree, we'll loop arbitrarily long before
finding a key we can return.

This comes up in the unit tests, where we're using inode 0 for our test
keys.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-24 19:09:52 -04:00
Kent Overstreet
394ef278e1 bcachefs: Unit test fixes
The peek_end() tests expect an empty btree.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-24 19:09:52 -04:00
Kent Overstreet
caab547686 bcachefs: Print mount opts earlier
If we aren't mounting with the correct degraded option, it's helpful to
know that before we fail to mount degraded.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-24 19:09:52 -04:00
Kent Overstreet
7cb85324c4 bcachefs: unlink: casefold d_invalidate
casefolding results in additional aliases on lookup for the
non-casefolded names - these need invalidating on unlink.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-24 19:09:52 -04:00
Kent Overstreet
9cdde3c7aa bcachefs: Fix casefold lookups
Add casefolding to bch2_lookup_trans:

During the delay between when casefolding was written and when it was
merged, the main filesystem lookup path grew self healing - which meant
it was no longer using bch2_dirent_lookup_trans(), where casefolding on
lookups happens.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-24 19:09:52 -04:00
Kent Overstreet
b9e1f873d2 bcachefs: Casefold is now a regular opts.h option
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-24 19:09:00 -04:00
Kent Overstreet
7a4a86618e bcachefs: Implement fileattr_(get|set)
inode_operations.fileattr_(get|set) didn't exist when the various flag
ioctls where implemented - but they do now, which means we can delete a
bunch of ioctl code in favor of standard VFS level wrappers.

Closes: https://lore.kernel.org/linux-bcachefs/7ltgrgqgfummyrlvw7hnfhnu42rfiamoq3lpcvrjnlyytldmzp@yazbhusnztqn/
Cc: Petr Vorel <pvorel@suse.cz>
Cc: Andrea Cervesato <andrea.cervesato@suse.de>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-21 19:50:56 -04:00
Kent Overstreet
4ede80a9a8 bcachefs: Allocator now copes with unaligned buckets
We had a buggy release of bcachefs-tools that wasn't properly aligning
bucket sizes.

We can't ask users to reformat - and it's easy to teach the allocator to
make sure writes are properly aligned.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-21 19:36:45 -04:00
Kent Overstreet
387df33129 bcachefs: Start copygc, rebalance threads earlier
Previously, copygc and rebalance weren't started until the very end of
mounting, after all recvoery passes have finished.

But copygc really should be started earlier, since it may be needed for
allocations to make forward progress. Additionally, we've been seeing
occasional bug reports where starting the kthread fails due to a pending
signal - i.e. we're getting timed out by systemd (during a version
upgrade), but we're not seeing the signal until mount is about to
complete.

Additionally, we now have copygc/rebalance explicitly wait for
check_snapshots to complete (if being run); they require that for
snapshot_is_ancestor() in the data move path.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-21 11:57:24 -04:00
Kent Overstreet
d64e8e842b bcachefs: Refactor bch2_run_recovery_passes()
Don't use a continue; this simplifies the next patch where
run_recovery_passes() will be responsible for waking up copygc and
rebalance at the appropriate time.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-21 11:56:43 -04:00
Kent Overstreet
10e42b6f25 bcachefs: bch2_copygc_wakeup()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-20 20:01:48 -04:00
Kent Overstreet
bfbb76ec98 bcachefs: Fix ref leak in write_super()
found with the new enumerated_ref code

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-20 19:41:38 -04:00