Data move -> move_get_io_opts -> bch2_get_update_rebalance_opts
requires a not_extents iterator; this fixes the path where we're walking
the extents btree and chase a reflink pointer into the reflink btree.
bch2_lookup_indirect_extent() requires working with an extents iterator
(due to peek_slot() semantics), so we implement
bch2_lookup_indirect_extent_for_move().
This is simplified because there's no need to report
indirect_extent_missing_errors here, that can be deferred until fsck or
when a user reads that data.
Reported-by: Maël Kerbiriou <mael.kerbiriou@free.fr>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
There's various checks for "are we going to compress this" - but we're
not going to compress if we know it's incompressible.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We weren't checking that accounting keys have the expected number of
accounters. Originally we probably wanted to be flexible on this, but it
doesn't look like that will be required - accounting is extended by
adding new counter types, not more counters to an existing type.
This means we can drop a BUG_ON() that popped once in automated testing,
and the new validation will make that bug easier to track down.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Also, improve the message in prep_encoded_data() - it now prints
good/bad checksums, and checksum type.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
__bch2_read, before calling __bch2_read_extent(), sets bvec_iter.bi_size
to "the size we can read from the current extent" with a swap, and
restores it to "the size for the total read" after the read_extent call
with another swap.
But we neglected to do the restore before the "if (ret) goto err;" -
which is a problem if we're retrying those errors.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
If we're moving an extent that was partially overwritten,
bch2_write_rechecksum() will trim it to the currenty live range.
If we then also want to compress it, it'll be decrypted - but the nonce
has been advanced for the overwritten start of the extent that we
dropped, and we were using the nonce we calculated before rechecksum().
Reported-by: Gabriel de Perthuis <g2p.code@gmail.com>
Fixes: 127d90d282 ("bcachefs: bch2_write_prep_encoded_data() now returns errcode")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
In debug mode, we save the call stack on transaction restart - but
there's no locking, so we can't touch it if we're issuing the restart
from another thread.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We're hitting some issues with uninitialized struct padding, flagged by
kmsan.
They appear to be falso positives, otherwise bch2_accounting_validate()
would have flagged them as "junk at end". But for now, we'll need to
initialize disk_accounting_pos with memset().
This adds a new helper, bch2_disk_accounting_mod2(), that initializes a
disk_accounting_pos and does the accounting mod all at once - so overall
things actually get slightly more ergonomic.
BCH_DISK_ACCOUNTING_replicas keys are left for now; KMSAN isn't warning
about them and they're a bit special.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We appear to be tripping over a compiler/kmsan bug with padding fields -
this is an easy workaround.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We may sometimes read from uninitialized memory; we know, and that's ok.
We check if a btree node has been reused before waiting on any
outstanding IO.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We store to all fields, so the kmsan warnings were spurious - but
initializing via stores to bitfields appear to have been giving the
compiler/kmsan trouble, and they're not necessary.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
More on the "full online self healing" project:
We now run most of the dirent <-> inode consistency checks, with repair
code, at runtime - the exact same check and repair code that fsck runs.
This will allow us to repair the "dirent points to inode that does not
point back" inconsistencies that have been popping up at runtime.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Prep work for calling bch2_check_dirent_target() from bch2_lookup().
- Add an inline wrapper, if the target and backpointer match we can skip
the function call.
- We don't (yet?) want to remove the dirent we did the lookup from (when
we find a directory or subvol with multiple valid dirents pointing to
it), we can defer on that until later. For now, add an "are we in
fsck?" parameter.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We're gradually running more and more fsck.c checks at runtime,
whereever applicable; when we do so they get moved out of fsck.c.
Next patch will call bch2_check_dirent_target() from bch2_lookup().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Replace these with proper private error codes, so that when we get an
error message we're not sifting through the entire codebase to see where
it came from.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The smp_rmb() guarantees that reads from reservations.counter
occur before accessing cur_entry_u64s. It's paired with the
atomic64_try_cmpxchg in journal_entry_open.
Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Convert these to standard error codes, which means we can pass them
outside the journal code, they're easier to pass to tracepoints, etc.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
the discard option is special, because it's both a filesystem and a
device option.
When set at the filesytsem level, it's supposed to propagate to (if set
persistently via sysfs) or override (if non persistently as a mount
option) the devices - that now works correctly.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Other options can normally be set at runtime via sysfs, no reason for
this one not to be as well - it just doesn't support the degraded flags
argument this way, that requires the ioctl.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Device options now use the common code for sysfs, and can superblock
fields (in a struct bch_member).
This replaces BCH_DEV_OPT_SETTERS(), which was weird and easy to miss.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Previously, device options had their superblock option field listed
separately, which was weird and easy to miss when defining options.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
atomic64_read(&j->seq) - j->seq_write_started == JOURNAL_STATE_BUF_NR is
the condition in journal_entry_open where we return JOURNAL_ERR_max_open,
so journal_cur_seq(j) - seq == JOURNAL_STATE_BUF_NR means that the buf
corresponding to seq has started to write.
Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Rebalance requires a not_extents iterator.
This wasn't hit before because all_snapshots disableds is_extents on
snapshots btrees - but has no effect on the reflink btree.
Reported-by: Maël Kerbiriou <mael.kerbiriou@free.fr>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This was missed - but it needs to be correct for the superblock recovery
tool that scans the start and end of the device for backup superblocks:
we don't want to pick up superblocks that belong to a different
partition that starts at a different offset.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Minor refactoring, so that bch2_sb_validate() can be used in the new
userspace superblock recovery tool.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
If we can't mount because of an incompatibility, print what's supported
and unsupported - to help solve PEBKAC issues.
Reported-by: Roland Vet <vet.roland@protonmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Just use sha256() instead of the clunky crypto API. This is much
simpler.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Since bcachefs does not access crc32c and crc64 through the crypto API,
there is no need to use module softdeps to ensure they are loaded.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This fixes another "rebalance spinning and doing no work" issue;
rebalance was reading extents it wanted to move, but then failing in
bch2_write() -> bch2_alloc_sectors_start() due to being unable to
allocate sufficient replicas.
This was triggered by a user playing with the durability settings, the
foreground device was an NVME device with durability=2, and originally
he'd set the background device to durability=2 as well, but changed it
back to 1 (the default) after seeing IO errors.
That meant that with replicas=2, we want to move data off the NVME
device which satisfies that constraint, but with a single durability=1
device on the background target there's no way to move the extent to
that target while satisfiying the "required replicas" constraint.
The solution for now is for bch2_data_update_init() to check for this,
and return an error - before kicking off the read.
bch2_data_update_init() already had two different checks for "will we be
able to write this extent", with partially duplicated code, so this
patch combines and improves that logic.
Additionally, we now always bail out and return an error if there's
insufficient space on the destination target. Previously, we only did
this for BCH_WRITE_alloc_nowait moves, because it might be the case that
copygc just needs to free up space on the destination target.
But we really shouldn't kick off a move if the destination is full, we
can't currently distinguish between "really full" and "just need to wait
for copygc", and if we are going to wait on copygc it'd be better to do
that before kicking off the move.
This will additionally fix "rebalance spinning" issues caused by a
filesystem that has more data than can fit in background_target - which
is a valid scenario, since we don't exclude foreground/cache devices
when calculating filesystem capacity.
Reported-by: Maël Kerbiriou <mael.kerbiriou@free.fr>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>