Since we now always preallocate the maximum number of iterators when we
initialize a btree transaction, getting an iterator never fails - we can
delete a fair amount of error path code.
This patch also simplifies the iterator allocation code a bit.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This way, these tests can be used with tests that inject IO errors and
shut down the filesystem.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Introducing the journal+btree iter introduced a regression where we
stopped using BTREE_ITER_PREFETCH - this is a performance regression on
rotating disks.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
For the new nodes an interior btree update makes reachable, updates to
those nodes may be journalled after the btree update starts but before
the transactional part - where we make those nodes reachable. Those
updates need to be kept in the journal until after the btree update
completes, hence we should always get a journal pin at the start of the
interior update.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
In the btree key cache code, failing to flush a dirty key is a serious
error, but it doesn't need to be a BUG_ON(), we can stop the filesystem
instead.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The rhashtable code doesn't like when we destroy an rhashtable that was
never initialized
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We can't run journal reclaim until we've finished replaying updates to
interior btree nodes - the check for this was in the wrong place though,
leading to journal reclaim spinning before it was allowed to proceed.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We were incorrectly ignoring the return value of __readahead_batch,
leading to a null ptr deref in __bch2_page_state_create().
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
fsck doesn't know about the btree key cache, and non-cached iterators
aren't cache coherent (yet?)
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This is needed to ensure we don't deadlock because journal reclaim and
thus memory reclaim isn't making forward progress.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Memory reclaim requires journal reclaim to make forward progress - it's
what cleans our caches - thus, while we're in journal reclaim or holding
the journal reclaim lock we can't recurse into memory reclaim.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The transaction restart path traverses all iterators, we don't need to
do it here.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Ensuring the key cache isn't too dirty is critical for ensuring that the
shrinker can reclaim memory.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The shrinker should start scanning for entries that can be freed oldest
to newest - this way, we can avoid scanning a lot of entries that are
too new to be freed.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We allocate a lot of these, and we're seeing sporading OOMs - this will
help with tracking those down.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We were incorrectly detecting a journal deadlock - the journal filling
up - when only the journal pin fifo had filled up; if the journal pin
fifo is full that just means we need to wait on reclaim.
This plumbs through better error reporting so we can better discriminate
in the journal_res_get path what's going on.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
When we detect bad keys in the journal that have to be dropped, the flow
control was wrong - we ended up not checking the next key in that entry.
Oops.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Improved the way we track various state by adding j->err_seq, which
records the first journal sequence number that encountered an error
being written, and j->last_empty_seq, which records the most recent
journal entry that was completely empty.
Also, use the low bits of the journal sequence number to index the
corresponding journal_buf.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This is used to print keys that failed bch2_bkey_invalid(), so be more
careful with k->type.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Previously, the journal entry read code was changed so that if we got a
journal entry that failed validation, we'd try to use it, preferring to
use a good version from another device if available.
But this left a bug where if an earlier validation check (say, checksum)
failed, the later checks (for last_seq) wouldn't run and we'd end up
using a journal entry with a garbage last_seq field. This fixes that so
that the later validation checks run and if necessary change those
fields to something sensible.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
In the dio write path, when get_user_pages() invokes the fault handler
we have a recursive locking situation - we have to handle the lock
ordering ourselves or we have a deadlock: this patch addresses that by
checking for locking ordering violations and doing the unlock/relock
dance if necessary.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bch2_varint_decode can do reads up to 7 bytes past the end ptr, for the
sake of performance - these extra bytes are always masked off.
This won't be a problem in practice if we make sure to burn 8 bytes in
any buffer that has bkeys in it.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
On emergency shutdown, we might still have dirty keys in the btree key
cache that need to be cleaned up properly.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>