When we detect bad keys in the journal that have to be dropped, the flow
control was wrong - we ended up not checking the next key in that entry.
Oops.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Improved the way we track various state by adding j->err_seq, which
records the first journal sequence number that encountered an error
being written, and j->last_empty_seq, which records the most recent
journal entry that was completely empty.
Also, use the low bits of the journal sequence number to index the
corresponding journal_buf.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This is used to print keys that failed bch2_bkey_invalid(), so be more
careful with k->type.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Previously, the journal entry read code was changed so that if we got a
journal entry that failed validation, we'd try to use it, preferring to
use a good version from another device if available.
But this left a bug where if an earlier validation check (say, checksum)
failed, the later checks (for last_seq) wouldn't run and we'd end up
using a journal entry with a garbage last_seq field. This fixes that so
that the later validation checks run and if necessary change those
fields to something sensible.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
In the dio write path, when get_user_pages() invokes the fault handler
we have a recursive locking situation - we have to handle the lock
ordering ourselves or we have a deadlock: this patch addresses that by
checking for locking ordering violations and doing the unlock/relock
dance if necessary.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bch2_varint_decode can do reads up to 7 bytes past the end ptr, for the
sake of performance - these extra bytes are always masked off.
This won't be a problem in practice if we make sure to burn 8 bytes in
any buffer that has bkeys in it.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
On emergency shutdown, we might still have dirty keys in the btree key
cache that need to be cleaned up properly.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This lets us improve journal reclaim, so that it now tries to make sure
no more than 3/4s of the btree node cache and btree key cache are dirty
- ensuring the shrinkers can free memory.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The change to use the cpu nr for the high bits of new inode numbers
means that inode numbers are very space - we see -ENOMEM during fsck
without this.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Previous varint implementation used by the inode code was not nearly as
fast as it could have been; partly because it was attempting to encode
integers up to 96 bits (for timestamps) but this meant that encoding and
decoding the length required a table lookup.
Instead, we'll just encode timestamps greater than 64 bits as two
separate varints; this will make decoding/encoding of inodes
significantly faster overall.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
this function is only used by debug code, but we'd like to always build
it so we know that it does build.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This only did anything in two places, and those can just be replaced
wiht bkey_cmp_left_packed()).
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Allocating our array of btree iters is a big enough allocation that it
hits the buddy allocator, and we're seeing lots of lock contention.
Sticking a single element buffer in front of it should help.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
PAGE_SIZE and size_t are not unsigned longs on 32 bit, annoying...
also switch to atomic64_cmpxchg instead of cmpxchg() for
journal_seq_copy, as atomic64_cmpxchg has a fallback that uses spinlocks
for when it's not supported.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This shards new inodes into different btree nodes by using the processor
ID for the high bits of the new inode number. Much faster than the
previous inode create optimization - this also helps with sharding in
the other btrees that index by inode number.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
perhaps a bit silly, but some debug assertions we want to add need const
propagated a bit more.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
These haven't been in used since reallocing iterators has been disabled,
and saves us a lot of stack if we get rid of it.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
With the btree key cache code, journal reclaim now has a lot more work
to do. It could be the case that after journal reclaim has finished one
iteration there's already more work to do, so put it in a loop to check
for that.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
On workloads that do a lot of multithreaded creates all at once, lock
contention on the inodes btree turns out to still be an issue.
This patch adds a small buffer of inode numbers that are known to be
free, so that we can avoid touching the btree on every create. Also,
this changes inode creates to update via the btree key cache for the
initial create.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The check for whether locking a btree node would deadlock was wrong - we
have to check that interior nodes are locked before descendents, but
this check was wrong when consider cached vs. non cached iterators.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We have a bug where we can get stuck with a process spinning in
transaction restarts - need more information.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
A bkey_on_stack_realloc() call was in the wrong place, and broken for
indirect extents
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This switches inode updates to use cached btree iterators - which should
be a nice performance boost, since lock contention on the inodes btree
can be a bottleneck on multithreaded workloads.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
- fiemap didn't know about inline extents, fixed
- advancing to the next extent after we'd chased a pointer to the
reflink btree was wrong, fixed
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
There was a bug where bch2_trans_update() would incorrectly delete a
pending update where the new update did not actually overwrite the
existing update, because we were incorrectly using BTREE_ITER_TYPE when
sorting pending btree updates.
This affects the pending patch to use cached iterators for inode
updates.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This is to fix a (harmless) bug where the read clock hand in the
superblock doesn't match the journal.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This fixes a bug where we'd pop an assertion due to replaying a key for
an interior btree node when that node no longer exists.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>