Commit Graph

1367581 Commits

Author SHA1 Message Date
Kent Overstreet
94426e4201 bcachefs: opts.casefold_disabled
Add an option for completely disabling casefolding on a filesystem, as a
workaround for overlayfs.

This should only be needed as a temporary workaround, until the
overlayfs fix arrives.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-01 19:33:46 -04:00
Kent Overstreet
c6e8d51b37 bcachefs: Work around deadlock to btree node rewrites in journal replay
Don't mark btree nodes for rewrites, if they are or would be degraded,
if journal replay hasn't finished, to avoid a deadlock.

This is because btree node rewrites generate more updates for the
interior updates (alloc, backpointers), and if those updates touch
new nodes and generate more rewrites - we can only have so many interior
btree updates in flight before we deadlock on open_buckets.

The biggest cause is that we don't use the btree write buffer (for
the backpointer updates - this needs some real thought on locking in
order to fix.

The problem with this workaround (not doing the rewrite for degraded
nodes in journal replay) is that those degraded nodes persist, and we
don't want that (this is a real bug when a btree node write completes
with fewer replicas than we wanted and leaves a degraded node due to
device _removal_, i.e. the device went away mid write).

It's less of a bug here, but still a problem because we don't yet
have a way of tracking degraded data - we another index (all
extents/btree nodes, by replicas entry) in order to fix properly
(re-replicate degraded data at the earliest possible time).

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-01 19:33:46 -04:00
Alan Huang
fbf913cb72 bcachefs: Fix incorrect transaction restart handling
Reported-by: syzbot+cc7567f096079cb4146f@syzkaller.appspotmail.com
Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-30 17:28:55 -04:00
Kent Overstreet
14da58521e bcachefs: fix btree_trans_peek_prev_journal()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-29 00:47:52 -04:00
Bharadwaj Raju
96de8f8520 bcachefs: mark invalid_btree_id autofix
Checking for invalid IDs was introduced in 9e7cfb35e2 ("bcachefs: Check for invalid btree IDs")
to prevent an invalid shift later, but since 1415265480 ("bcachefs: Bad btree roots are now autofix")
which made btree_root_bkey_invalid autofix, the fsck_err_on call didn't
do anything.

We can mark this err type (invalid_btree_id) autofix as well, so it gets
handled.

Reported-by: syzbot+029d1989099aa5ae3e89@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=029d1989099aa5ae3e89
Fixes: 1415265480 ("bcachefs: Bad btree roots are now autofix")

Signed-off-by: Bharadwaj Raju <bharadwaj.raju777@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-27 12:47:07 -04:00
Kent Overstreet
ef6fac0f9e bcachefs: Plumb correct ip to trans_relock_fail tracepoint
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-26 00:01:16 -04:00
Kent Overstreet
64b6a788bd bcachefs: Ensure we rewind to run recovery passes
Fix a 6.16 regression from the recovery pass rework, which introduced a
bug where calling bch2_run_explicit_recovery_pass() would only return
the error code to rewind recovery for the first call that scheduled that
recovery pass.

If the error code from the first call was swallowed (because it was
called by an asynchronous codepath), subsequent calls would go "ok, this
pass is already marked as needing to run" and return 0.

Fixing this ensures that check_topology bails out to run btree_node_scan
before doing any repair.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-26 00:01:16 -04:00
Kent Overstreet
3e72acb78b bcachefs: Ensure btree node scan runs before checking for scanned nodes
Previously, calling bch2_btree_has_scanned_nodes() when btree node
scan hadn't actually run would erroniously return false - causing us to
think a btree was entirely gone.

This fixes a 6.16 regression from moving the scheduling of btree node
scan out of bch2_btree_lost_data() (fixing the bug where we'd schedule
it persistently in the superblock) and only scheduling it when
check_toploogy() is asking for scanned btree nodes.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-26 00:01:16 -04:00
Kent Overstreet
1dcea07810 bcachefs: btree_root_unreadable_and_scan_found_nothing should not be autofix
Autofix is specified in btree_gc.c if it's not an important btree.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-26 00:01:16 -04:00
Kent Overstreet
1f8aede70d bcachefs: fix bch2_journal_keys_peek_prev_min() underflow
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-24 18:58:18 -04:00
Kent Overstreet
f5109c201c bcachefs: Use wait_on_allocator() when allocating journal
wait_on_allocator() emits debug info when we hang trying to allocate.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-24 18:16:01 -04:00
Kent Overstreet
865ad1dbf1 bcachefs: Check for bad write buffer key when moving from journal
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-24 15:48:00 -04:00
Alan Huang
5c4acbc8ce bcachefs: Don't unlock the trans if ret doesn't match BCH_ERR_operation_blocked
Reported-by: syzbot+d540192e763531d307ff@syzkaller.appspotmail.com
Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-24 15:46:59 -04:00
Kent Overstreet
72c0d9cb0f bcachefs: Fix range in bch2_lookup_indirect_extent() error path
Before calling bch2_indirect_extent_missing_error(), we have to
calculate the missing range, which is the intersection of the reflink
pointer and the non-indirect-extent we found.

The calculation didn't take into account that the returned extent may
span the iter position, leading to an infinite loop when we
(unnecessarily) resized the extent we were returning to one that didn't
extend past the offset we were looking up.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-22 00:29:03 -04:00
Kent Overstreet
abcb6bd4be bcachefs: fix spurious error_throw
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-22 00:29:03 -04:00
Kent Overstreet
bb378314ce bcachefs: Add missing bch2_err_class() to fileattr_set()
Make sure we return a standard error code.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-22 00:29:03 -04:00
Kent Overstreet
b2e2bed119 bcachefs: Add missing key type checks to check_snapshot_exists()
For now we only have one key type in these btrees, but forward
compatibility means we do have to check.

Reported-by: syzbot+b4cb4a6988aced0cec4b@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-19 14:37:04 -04:00
Kent Overstreet
32a01cd433 bcachefs: Don't log fsck err in the journal if doing repair elsewhere
This fixes exceeding the bump allocator limit when the allocator finds
many buckets that need repair - they're repaired asynchronously, which
means that every error logged a message in the bump allocator, without
committing.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-19 13:08:07 -04:00
Kent Overstreet
b2348fe6c8 bcachefs: Fix *__bch2_trans_subbuf_alloc() error path
Don't change buf->size on error - this would usually be a transaction
restart, but it could also be -ENOMEM - when we've exceeded the bump
allocator max).

Fixes: 247abee6ae ("bcachefs: btree_trans_subbuf")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-19 13:08:06 -04:00
Kent Overstreet
434635987f bcachefs: Fix missing newlines before ero
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-17 20:45:27 -04:00
Kent Overstreet
88bd771191 bcachefs: fix spurious error in read_btree_roots()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-17 20:45:26 -04:00
Kent Overstreet
1df310860a bcachefs: fsck: Fix oops in key_visible_in_snapshot()
The normal fsck code doesn't call key_visible_in_snapshot() with an
empty list of snapshot IDs seen (the current snapshot ID will always be
on the list), but str_hash_repair_key() ->
bch2_get_snapshot_overwrites() can, and that's totally fine as long as
we check for it.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-17 20:45:26 -04:00
Kent Overstreet
3f890768da bcachefs: fsck: fix unhandled restart in topology repair
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-17 20:45:26 -04:00
Kent Overstreet
bbc3a0b17a bcachefs: fsck: Fix check_directory_structure when no check_dirents
check_directory_structure runs after check_dirents, so it expects that
it won't see any inodes with missing backpointers - normally.

But online fsck can't run check_dirents yet, or the user might only be
running a specific pass, so we need to be careful that this isn't an
error. If an inode is unreachable, that's handled by a separate pass.

Also, add a new 'bch2_inode_has_backpointer()' helper, since we were
doing this inconsistently.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-17 13:35:19 -04:00
Kent Overstreet
e1f0e1a45a bcachefs: Fix restart handling in btree_node_scrub_work()
btree node scrub was sometimes failing to rewrite nodes with errors;
bch2_btree_node_rewrite() can return a transaction restart and we
weren't checking - the lockrestart_do() needs to wrap the entire
operation.

And there's a better helper it should've been using,
bch2_btree_node_rewrite_key(), which makes all this more convenient.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-17 11:42:06 -04:00
Kent Overstreet
6c4897caef bcachefs: Fix bch2_read_bio_to_text()
We can only pass negative error codes to bch2_err_str(); if it's a
positive integer it's not an error and we trip an assert.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-16 20:35:42 -04:00
Kent Overstreet
495ba899d5 bcachefs: fsck: Fix check_path_loop() + snapshots
A path exists in a particular snapshot: we should do the pathwalk in the
snapshot ID of the inode we started from, _not_ change snapshot ID as we
walk inodes and dirents.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-16 19:05:02 -04:00
Kent Overstreet
583ba52a40 bcachefs: fsck: check_subdir_count logs path
We can easily go from inode number -> path now, which makes for more
useful log messages.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-16 19:05:02 -04:00
Kent Overstreet
8d6ac82361 bcachefs: fsck: additional diagnostics for reattach_inode()
Log the inode's new path.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-16 19:05:02 -04:00
Kent Overstreet
3e5ceaa5bf bcachefs: fsck: check_directory_structure runs in reverse order
When we find a directory connectivity problem, we should do the repair
in the oldest snapshot that has the issue - so that we don't end up
duplicating work or making a real mess of things.

Oldest snapshot IDs have the highest integer value, so - just walk
inodes in reverse order.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-16 19:05:02 -04:00
Kent Overstreet
9fb09ace59 bcachefs: fsck: Fix reattach_inode() for subvol roots
bch_subvolume.fs_path_parent needs to be updated as well, it should
match inode.bi_parent_subvol.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-16 19:04:59 -04:00
Kent Overstreet
c1ca07a4dd bcachefs: fsck: Fix remove_backpointer() for subvol roots
The dirent will be in a different snapshot if the inode is a subvolume
root.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-16 19:04:54 -04:00
Kent Overstreet
7029cc4d13 bcachefs: fsck: Print path when we find a subvol loop
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-16 19:04:48 -04:00
Kent Overstreet
9ba6930ef8 bcachefs: Fix __bch2_inum_to_path() when crossing subvol boundaries
The bch2_subvolume_get_snapshot() call needs to happen before the dirent
lookup - the dirent is in the parent subvolume.

Also, check for loops.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-16 19:04:48 -04:00
Kent Overstreet
1cddad0fcb bcachefs: Call bch2_fs_init_rw() early if we'll be going rw
kthread creation checks for pending signals, which is _very_ annoying if
we have to do a long recovery and don't go rw until we've done
significant work.

Check if we'll be going rw and pre-allocate kthreads/workqueues.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-16 19:04:48 -04:00
Kent Overstreet
f2a701fd94 bcachefs: fsck: Improve check_key_has_inode()
Print out more info when we find a key (extent, dirent, xattr) for a
missing inode - was there a good inode in an older snapshot, full(ish)
list of keys for that missing inode, so we can make better decisions on
how to repair.

If it looks like it should've been deleted, autofix it. If we ever hit
the non-autofix cases, we'll want to write more repair code (possibly
reconstituting the inode).

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-16 19:04:44 -04:00
Bharadwaj Raju
03208bd06a bcachefs: don't return fsck_fix for unfixable node errors in __btree_err
After cd3cdb1ef7 ("Single err message for btree node reads"),
all errors caused __btree_err to return -BCH_ERR_fsck_fix no matter what
the actual error type was if the recovery pass was scanning for btree
nodes. This lead to the code continuing despite things like bad node
formats when they earlier would have caused a jump to fsck_err, because
btree_err only jumps when the return from __btree_err does not match
fsck_fix. Ultimately this lead to undefined behavior by attempting to
unpack a key based on an invalid format.

Make only errors of type -BCH_ERR_btree_node_read_err_fixable cause
__btree_err to return -BCH_ERR_fsck_fix when scanning for btree nodes.

Reported-by: syzbot+cfd994b9cdf00446fd54@syzkaller.appspotmail.com
Fixes: cd3cdb1ef7 ("bcachefs: Single err message for btree node reads")
Signed-off-by: Bharadwaj Raju <bharadwaj.raju777@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-16 19:03:52 -04:00
Alan Huang
56be92c63f bcachefs: Fix pool->alloc NULL pointer dereference
btree_interior_update_pool has not been initialized before the
filesystem becomes read-write, thus mempool_alloc in bch2_btree_update_start
will trigger pool->alloc NULL pointer dereference in mempool_alloc_noprof

Reported-by: syzbot+2f3859bd28f20fa682e6@syzkaller.appspotmail.com
Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-16 19:03:52 -04:00
Alan Huang
d89a34b14d bcachefs: Move bset size check before csum check
In syzbot's crash, the bset's u64s is larger than the btree node.

Reported-by: syzbot+bfaeaa8e26281970158d@syzkaller.appspotmail.com
Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-16 19:03:52 -04:00
Kent Overstreet
7c9cef5f8b bcachefs: mark more errors autofix
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-16 19:03:52 -04:00
Kent Overstreet
10dfe4926d bcachefs: Kill unused tracepoints
Dead code cleanup.

Link: https://lore.kernel.org/linux-bcachefs/20250612224059.39fddd07@batman.local.home/
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-16 19:03:52 -04:00
Kent Overstreet
17c3395e25 bcachefs: opts.journal_rewind
Add a mount option for rewinding the journal, bringing the entire
filesystem to where it was at a previous point in time.

This is for extreme disaster recovery scenarios - it's not intended as
an undelete operation.

The option takes a journal sequence number; the desired sequence number
can be determined with 'bcachefs list_journal'

Caveats:

- The 'journal_transaction_names' option must have been enabled (it's on
  by default). The option controls emitting of extra debug info in the
  journal, so we can see what individual transactions were doing;
  It also enables journalling of keys being overwritten, which is what
  we rely on here.

- A full fsck run will be automatically triggered since alloc info will
  be inconsistent. Only leaf node updates to non-alloc btrees are
  rewound, since rewinding interior btree updates isn't possible or
  desirable.

- We can't do anything about data that was deleted and overwritten.

  Lots of metadata updates after the point in time we're rewinding to
  shouldn't cause a problem, since we segragate data and metadata
  allocations (this is in order to make repair by btree node scan
  practical on larger filesystems; there's a small 64-bit per device
  bitmap in the superblock of device ranges with btree nodes, and we try
  to keep this small).

  However, having discards enabled will cause problems, since buckets
  are discarded as soon as they become empty (this is why we don't
  implement fstrim: we don't need it).

  Hopefully, this feature will be a one-off thing that's never used
  again: this was implemented for recovering from the "vfs i_nlink 0 ->
  subvol deletion" bug, and that bug was unusually disastrous and
  additional safeguards have since been implemented.

  But if it does turn out that we need this more in the future, I'll
  have to implement an option so that empty buckets aren't discarded
  immediately - lagging by perhaps 1% of device capacity.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-16 19:03:52 -04:00
Kent Overstreet
191334400d bcachefs: fsck: fix extent past end of inode repair
Fix the case where we're deleting in a different snapshot and need to
emit a whiteout - that requires a regular BTREE_ITER_filter_snapshots
iterator.

Also, only delete the part of the extent that extents past i_size.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-15 22:11:56 -04:00
Kent Overstreet
b17d7bdb12 bcachefs: fsck: fix add_inode()
the inode btree uses the offset field for the inum, not the inode field.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-15 22:11:56 -04:00
Kent Overstreet
c27e5782d9 bcachefs: Fix snapshot_key_missing_inode_snapshot repair
When the inode was a whiteout, we were inserting a new whiteout at the
wrong (old) snapshot.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-15 22:11:56 -04:00
Kent Overstreet
c1ccd43b35 bcachefs: Fix "now allowing incompatible features" message
Check against version_incompat_allowed, not version_incompat.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-15 22:11:56 -04:00
Kent Overstreet
2ba562cc04 bcachefs: pass last_seq into fs_journal_start()
Prep work for journal rewind, where the seq we're replaying from may be
different than the last journal entry's last_seq.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-15 22:11:56 -04:00
Kent Overstreet
f2ed089273 bcachefs: better __bch2_snapshot_is_ancestor() assert
Previously, we weren't checking the result of the skiplist walk, just
the is_ancestor bitmap.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-15 22:11:56 -04:00
Kent Overstreet
425da82c63 bcachefs: btree_iter: fix updates, journal overlay
We need to start searching from search_key - _not_ path->pos, which will
point to the key we found in the btree

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-15 22:11:56 -04:00
Kent Overstreet
0e62fca2a6 bcachefs: Fix bch2_journal_keys_peek_prev_min()
this code is rarely invoked, so - we had a few bugs left from basing it
off of bch2_journal_keys_peek_max()...

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-15 22:11:55 -04:00