Commit Graph

1294593 Commits

Author SHA1 Message Date
Xiaxi Shen
d89b35d83e bcachefs: Fix a spelling error in docs
Signed-off-by: Xiaxi Shen <shenxiaxi26@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-09 09:41:48 -04:00
Kent Overstreet
c7652f253a bcachefs: promote_whole_extents is now a normal option
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-09 09:41:48 -04:00
Kent Overstreet
cfd273f1ae bcachefs: Move rebalance_status out of sysfs/internal
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-09 09:41:48 -04:00
Julian Sun
26c0900d85 bcachefs: remove the unused parameter in macro bkey_crc_next
In the macro definition of bkey_crc_next, five parameters
were accepted, but only four of them were used. Let's remove
the unused one.

The patch has only passed compilation tests, but it should be fine.

Signed-off-by: Julian Sun <sunjunchao2870@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-09 09:41:48 -04:00
Julian Sun
4d05a083b3 bcachefs: fix macro definition allocate_dropping_locks
The macro allocate_dropping_locks accepts a parameter _trans,
but it was not used, rather the variable trans was directly used,
which may be a local variable inside a function that calls the macros.

Signed-off-by: Julian Sun <sunjunchao2870@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-09 09:41:48 -04:00
Julian Sun
ba8c52e2b1 bcachefs: fix macro definition allocate_dropping_locks_errcode
The macro allocate_dropping_locks_errocode accepts a parameter _trans,
but it was not used, rather the variable trans was directly used,
which may be a local variable inside a function that calls the macros.

Signed-off-by: Julian Sun <sunjunchao2870@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-09 09:41:48 -04:00
Julian Sun
23fcd5f40a bcachefs: remove the unused macro definition
macro bch2_kthread_wait_event_ioclock_timeout is no longer used,
let's remove it.

The patch has passed compilation test.

Signed-off-by: Julian Sun <sunjunchao2870@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-09 09:41:48 -04:00
Kent Overstreet
668c955155 bcachefs: quota_reserve_range() -> for_each_btree_key_in_subvolume_upto
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-09 09:41:48 -04:00
Kent Overstreet
093dd55d19 bcachefs: bch2_folio_set() -> for_each_btree_key_in_subvolume_upto
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-09 09:41:48 -04:00
Kent Overstreet
c95285d17e bcachefs: range_has_data() -> for_each_btree_key_in_subvolume_upto
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-09 09:41:48 -04:00
Kent Overstreet
330405057f bcachefs: bch2_seek_hole() -> for_each_btree_key_in_subvolume_upto
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-09 09:41:48 -04:00
Kent Overstreet
9f9e7f50af bcachefs: bch2_seek_data() -> for_each_btree_key_in_subvolume_upto
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-09 09:41:48 -04:00
Kent Overstreet
3da106cd1b bcachefs: bch2_xattr_list() -> for_each_btree_key_in_subvolume_upto
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-09 09:41:48 -04:00
Kent Overstreet
efdb77a25b bcachefs: bch2_readdir() -> for_each_btree_key_in_subvolume_upto
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-09 09:41:48 -04:00
Kent Overstreet
0215b91804 bcachefs: for_each_btree_key_in_subvolume_upto()
New helper for looping over keys in a given subvolume

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-09 09:41:48 -04:00
Kent Overstreet
1a3158ece5 bcachefs: bch2_fiemap(): call trans_begin() on every loop iter
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-09 09:41:48 -04:00
Kent Overstreet
7e7595723c bcachefs: bchfs_read(): call trans_begin() on every loop iter
Same as the recent change for __bch2_read(); also, kill now unnecessary
btree_trans_too_many_iters() calls.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-09 09:41:48 -04:00
Kent Overstreet
804baca745 bcachefs: kill bch2_btree_iter_peek_and_restart()
dead code

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-09 09:41:48 -04:00
Kent Overstreet
32ed4a620c bcachefs: Btree path tracepoints
Fastpath tracepoints, rarely needed, only enabled with
CONFIG_BCACHEFS_PATH_TRACEPOINTS.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-09 09:41:48 -04:00
Kent Overstreet
abbfc4db50 bcachefs: Add check for btree_path ref overflow
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-09 09:41:48 -04:00
Youling Tang
094c6a9f5c bcachefs: Mark bch_inode_info as SLAB_ACCOUNT
After commit 230e9fc286 ("slab: add SLAB_ACCOUNT flag"), we need to mark
the inode cache as SLAB_ACCOUNT, similar to commit 5d097056c9 ("kmemcg:
account for certain kmem allocations to memcg")

Signed-off-by: Youling Tang <tangyouling@kylinos.cn>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-09 09:41:47 -04:00
Youling Tang
082330c361 bcachefs: allocate inode by using alloc_inode_sb()
The inode allocation is supposed to use alloc_inode_sb(), so convert
kmem_cache_alloc() to alloc_inode_sb().

It will also fix [1] to avoid the NULL pointer dereference BUG in
list_lru_add() when CONFIG_MEMCG is enabled.

Links:
[1]: https://lore.kernel.org/all/20589721-46c0-4344-b2ef-6ab48bbe2ea5@linux.dev/
[2]: https://lore.kernel.org/all/7db60e36-9c96-4938-a28d-a9745e287386@linux.dev/

Fixes: 86d81ec5f5 ("bcachefs: Mark bch_inode_info as SLAB_ACCOUNT")
Signed-off-by: Youling Tang <tangyouling@kylinos.cn>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-09 09:41:47 -04:00
Kent Overstreet
9092a38a3d bcachefs: Opt_durability can now be set via bch2_opt_set_sb()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-09 09:41:47 -04:00
Kent Overstreet
4aedeac570 bcachefs: bch2_opt_set_sb() can now set (some) device options
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-09 09:41:47 -04:00
Kent Overstreet
afefc986b7 bcachefs: data_allowed is now an opts.h option
need this so cmd_option in userspace can handle it

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-09 09:41:47 -04:00
Thorsten Blum
8573dd3474 bcachefs: Annotate struct bucket_array with __counted_by()
Add the __counted_by compiler attribute to the flexible array member
bucket to improve access bounds-checking via CONFIG_UBSAN_BOUNDS and
CONFIG_FORTIFY_SOURCE.

Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-09 09:41:47 -04:00
Nathan Chancellor
5396e5af3c bcachefs: Fix format specifier in bch2_btree_key_cache_to_text()
When building for a 32-bit architecture, for which 'size_t' is
'unsigned int', there is a compiler warning due to use of '%lu':

  In file included from fs/bcachefs/vstructs.h:5,
                   from fs/bcachefs/bcachefs_format.h:80,
                   from fs/bcachefs/bcachefs.h:207,
                   from fs/bcachefs/btree_key_cache.c:3:
  fs/bcachefs/btree_key_cache.c: In function 'bch2_btree_key_cache_to_text':
  fs/bcachefs/btree_key_cache.c:795:25: error: format '%lu' expects argument of type 'long unsigned int', but argument 3 has type 'size_t' {aka 'unsigned int'} [-Werror=format=]
    795 |         prt_printf(out, "pending:\t%lu\r\n",            per_cpu_sum(bc->nr_pending));
        |                         ^~~~~~~~~~~~~~~~~~~
  fs/bcachefs/util.h:78:63: note: in definition of macro 'prt_printf'
     78 | #define prt_printf(_out, ...)           bch2_prt_printf(_out, __VA_ARGS__)
        |                                                               ^~~~~~~~~~~
  fs/bcachefs/btree_key_cache.c:795:38: note: format string is defined here
    795 |         prt_printf(out, "pending:\t%lu\r\n",            per_cpu_sum(bc->nr_pending));
        |                                    ~~^
        |                                      |
        |                                      long unsigned int
        |                                    %u
  cc1: all warnings being treated as errors

Use the proper specifier, '%zu', to resolve the warning.

Fixes: e447e49977b8 ("bcachefs: key cache can now allocate from pending")
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-09 09:41:47 -04:00
Kent Overstreet
5f1929f1f0 bcachefs: key cache can now allocate from pending
btree_trans objects can hold the btree_trans_barrier srcu read lock for
an extended amount of time (they shouldn't, but it's difficult to
guarantee).

the srcu barrier blocks memory reclaim, so to avoid too many stranded
key cache items, this uses the new pending_rcu_items to allocate from
pending items - like we did before, but now without a global lock on the
key cache.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-09 09:41:47 -04:00
Kent Overstreet
f2bfe7e837 bcachefs: Rip out freelists from btree key cache
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-09 09:41:47 -04:00
Kent Overstreet
d2ed0f206a bcachefs: rcu_pending now works in userspace
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-09 09:41:47 -04:00
Kent Overstreet
8e973a4f3c bcachefs: rcu_pending
Generic data structure for explicitly tracking pending RCU items,
allowing items to be dequeued (i.e. allocate from items pending
freeing). Works with conventional RCU and SRCU, and possibly other RCU
flavors in the future, meaning this can serve as a more generic
replacement for SLAB_TYPESAFE_BY_RCU.

Pending items are tracked in radix trees; if memory allocation fails, we
fall back to linked lists.

A rcu_pending is initialized with a callback, which is invoked when
pending items's grace periods have expired. Two types of callback
processing are handled specially:

- RCU_PENDING_KVFREE_FN

  New backend for kvfree_rcu(). Slightly faster, and eliminates the
  synchronize_rcu() slowpath in kvfree_rcu_mightsleep() - instead, an
  rcu_head is allocated if we don't have one and can't use the radix
  tree

  TODO:
  - add a shrinker (as in the existing kvfree_rcu implementation) so that
    memory reclaim can free expired objects if callback processing isn't
    keeping up, and to expedite a grace period if we're under memory
    pressure and too much memory is stranded by RCU

  - add a counter for amount of memory pending

- RCU_PENDING_CALL_RCU_FN

  Accelerated backend for call_rcu() - pending callbacks are tracked in
  a radix tree to eliminate linked list overhead.

to serve as replacement backends for kvfree_rcu() and call_rcu(); these
may be of interest to other uses (e.g. SLAB_TYPESAFE_BY_RCU users).

Note:

Internally, we're using a single rearming call_rcu() callback for
notifications from the core RCU subsystem for notifications when objects
are ready to be processed.

Ideally we would be getting a callback every time a grace period
completes for which we have objects, but that would require multiple
rcu_heads in flight, and since the number of gp sequence numbers with
uncompleted callbacks is not bounded, we can't do that yet.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-09 09:41:47 -04:00
Kent Overstreet
b3f9da79e7 lib/generic-radix-tree.c: add preallocation
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-09 09:41:47 -04:00
Kent Overstreet
f659463381 lib/generic-radix-tree.c: genradix_ptr_inlined()
Provide an inlined fast path

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-09 09:41:47 -04:00
Kent Overstreet
54f7702466 bcachefs: Fix deadlock in __wait_on_freeing_inode()
We can't call __wait_on_freeing_inode() with btree locks held; we're
waiting on another thread that's in evict(), and before it clears that
bit it needs to write that inode to flush timestamps - deadlock.

Fixing this involves a fair amount of re-jiggering to plumb a new
transaction restart.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-09 09:41:47 -04:00
Kent Overstreet
112d21fd1a bcachefs: switch to rhashtable for vfs inodes hash
the standard vfs inode hash table suffers from painful lock contention -
this is long overdue

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-09 09:41:47 -04:00
Kent Overstreet
88d2ae0e6e inode: make __iget() a static inline
bcachefs is switching to an rhashtable for vfs inodes instead of the
standard inode.c hashtable, so we need this exported, or - a static
inline makes more sense for a single atomic_inc().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-09 09:41:47 -04:00
Reed Riley
27663d7784 bcachefs: Replace div_u64 with div64_u64 where second param is u64
Bcachefs often uses this function to divide by nanosecond times - which
can easily cause problems when cast to u32.  For example, `cat
/sys/fs/bcachefs/*/internal/rebalance_status` would return invalid data
in the `duration waited` field because dividing by the number of
nanoseconds in a minute requires the divisor parameter to be u64.

Signed-off-by: Reed Riley <reed@riley.engineer>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-09 09:41:47 -04:00
Feiko Nanninga
36f0af4f44 bcachefs: Fix sysfs rebalance duration waited formatting
cat /sys/fs/bcachefs/*/internal/rebalance_status
waiting
  io wait duration:  13.5 GiB
  io wait remaining: 627 MiB
  duration waited:   1392 m

duration waited was increasing at a rate of about 14 times the expected
rate.

div_u64 takes a u32 divisor, but u->nsecs (from time_units[]) can be
bigger than u32.

Signed-off-by: Feiko Nanninga <feiko.nanninga@fnanninga.de>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-09 09:41:47 -04:00
Alyssa Ross
a3ed1cc413 bcachefs: Fix negative timespecs
This fixes two problems in the handling of negative times:

 • rem is signed, but the rem * c->sb.nsec_per_time_unit operation
   produced a bogus unsigned result, because s32 * u32 = u32.

 • The timespec was not normalized (it could contain more than a
   billion nanoseconds).

For example, { .tv_sec = -14245441, .tv_nsec = 750000000 }, after
being round tripped through timespec_to_bch2_time and then
bch2_time_to_timespec would come back as
{ .tv_sec = -14245440, .tv_nsec = 4044967296 } (more than 4 billion
nanoseconds).

Cc: stable@vger.kernel.org
Fixes: 595c1e9bab ("bcachefs: Fix time handling")
Closes: https://github.com/koverstreet/bcachefs/issues/743
Co-developed-by: Erin Shepherd <erin.shepherd@e43.eu>
Signed-off-by: Erin Shepherd <erin.shepherd@e43.eu>
Co-developed-by: Ryan Lahfa <ryan@lahfa.xyz>
Signed-off-by: Ryan Lahfa <ryan@lahfa.xyz>
Signed-off-by: Alyssa Ross <hi@alyssa.is>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-09 09:41:47 -04:00
Kent Overstreet
16005147cc bcachefs: Don't delete open files in online fsck
If a file is unlinked but still open, we don't want online fsck to
delete it - or fun inconsistencies will happen.

https://github.com/koverstreet/bcachefs/issues/727

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-09 09:41:47 -04:00
Kent Overstreet
2c377d8a71 bcachefs: fix btree_key_cache sysfs knob
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-09 09:41:47 -04:00
Kent Overstreet
52df04f039 bcachefs: More BCH_SB_MEMBER_INVALID support
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-09 09:41:46 -04:00
Kent Overstreet
df88febc20 bcachefs: Simplify bch2_bkey_drop_ptrs()
bch2_bkey_drop_ptrs() had a some complicated machinery for avoiding
O(n^2) when dropping multiple pointers - but when n is only going to be
~4, it's not worth it.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-09 09:41:46 -04:00
Kent Overstreet
ec36573dcd bcachefs: Add a cond_resched() to __journal_keys_sort()
Without this, we'd potentially sort multiple times without a
cond_resched(), leading to hung task warnings on larger systems.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-09 09:41:46 -04:00
Kent Overstreet
5a6e43af1e bcachefs: Fix ca->io_ref usage
ca->io_ref does not protect against the filesystem going way,
c->write_ref does. Much like

0b50b7313e bcachefs: Fix refcounting in discard path

the other async paths need fixing.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-09 09:41:46 -04:00
Kent Overstreet
53f6619554 bcachefs: BCH_SB_MEMBER_INVALID
Create a sentinal value for "invalid device".

This is needed for removing devices that have stripes on them (force
removing, without evacuating); we need a sentinal value for the stripe
pointers to the device being removed.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-03 20:43:14 -04:00
Kent Overstreet
7f12a963b6 bcachefs: fix rebalance accounting
Fixes: 49aa783039 ("bcachefs: Fix rebalance_work accounting")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-01 15:54:40 -04:00
Kent Overstreet
3d3020c461 bcachefs: Mark more errors as autofix
errors that are known to always be safe to fix should be autofix: this
should be most errors even at this point, but that will need some
thorough review.

note that errors are still logged in the superblock, so we'll still know
that they happened.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-31 19:27:01 -04:00
Kent Overstreet
e3e6940940 bcachefs: Revert lockless buffered IO path
We had a report of data corruption on nixos when building installer
images.

https://github.com/NixOS/nixpkgs/pull/321055#issuecomment-2184131334

It seems that writes are being dropped, but only when issued by QEMU,
and possibly only in snapshot mode. It's undetermined if it's write
calls are being dropped or dirty folios.

Further testing, via minimizing the original patch to just the change
that skips the inode lock on non appends/truncates, reveals that it
really is just not taking the inode lock that causes the corruption: it
has nothing to do with the other logic changes for preserving write
atomicity in corner cases.

It's also kernel config dependent: it doesn't reproduce with the minimal
kernel config that ktest uses, but it does reproduce with nixos's distro
config. Bisection the kernel config initially pointer the finger at page
migration or compaction, but it appears that was erroneous; we haven't
yet determined what kernel config option actually triggers it.

Sadly it appears this will have to be reverted since we're getting too
close to release and my plate is full, but we'd _really_ like to fully
debug it.

My suspicion is that this patch is exposing a preexisting bug - the
inode lock actually covers very little in IO paths, and we have a
different lock (the pagecache add lock) that guards against races with
truncate here.

Fixes: 7e64c86cdc ("bcachefs: Buffered write path now can avoid the inode lock")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-31 19:26:08 -04:00
Kent Overstreet
d26935690c bcachefs: Fix bch2_extents_match() false positive
This was caught as a very rare nonce inconsistency, on systems with
encryption and replication (and tiering, or some form of rebalance
operation running):

[Wed Jul 17 13:30:03 2024] about to insert invalid key in data update path
[Wed Jul 17 13:30:03 2024] old: u64s 10 type extent 671283510:6392:U32_MAX len 16 ver 106595503: durability: 2 crc: c_size 8 size 16 offset 0 nonce 0 csum chacha20_poly1305_80 compress zstd ptr: 3:355968:104 gen 7 ptr: 4:513244:48 gen 6 rebalance: target hdd compression zstd
[Wed Jul 17 13:30:03 2024] k:   u64s 10 type extent 671283510:6400:U32_MAX len 16 ver 106595508: durability: 2 crc: c_size 8 size 16 offset 0 nonce 0 csum chacha20_poly1305_80 compress zstd ptr: 3:355968:112 gen 7 ptr: 4:513244:56 gen 6 rebalance: target hdd compression zstd
[Wed Jul 17 13:30:03 2024] new: u64s 14 type extent 671283510:6392:U32_MAX len 8 ver 106595508: durability: 2 crc: c_size 8 size 16 offset 0 nonce 0 csum chacha20_poly1305_80 compress zstd ptr: 3:355968:112 gen 7 cached ptr: 4:513244:56 gen 6 cached rebalance: target hdd compression zstd crc: c_size 8 size 16 offset 8 nonce 0 csum chacha20_poly1305_80 compress zstd ptr: 1:10860085:32 gen 0 ptr: 0:17285918:408 gen 0
[Wed Jul 17 13:30:03 2024] bcachefs (cca5bc65-fe77-409d-a9fa-465a6e7f4eae): fatal error - emergency read only

bch2_extents_match() was reporting true for extents that did not
actually point to the same data.

bch2_extent_match() iterates over pairs of pointers, looking for
pointers that point to the same location on disk (with matching
generation numbers). However one or both extents may have been trimmed
(or merged) and they might not have the same disk offset: it corrects
for this by subtracting the key offset and the checksum entry offset.

However, this failed when an extent was immediately partially
overwritten, and the new overwrite was allocated the next adjacent disk
space.

Normally, with compression off, this would never cause a bug, since the
new extent would have to be immediately after the old extent for the
pointer offsets to match, and the rebalance index update path is not
looking for an extent outside the range of the extent it moved.

However with compression enabled, extents take up less space on disk
than they do in the btree index space - and spuriously matching after
partial overwrite is possible.

To fix this, add a secondary check, that strictly checks that the
regions pointed to on disk overlap.

https://github.com/koverstreet/bcachefs/issues/717

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-26 20:33:12 -04:00