linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-02-24 04:15:07 -05:00

Author	SHA1	Message	Date
Kent Overstreet	e5a3b8cf33	bcachefs: More informative error message when shutting down due to error Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-28 22:42:17 -04:00
Kent Overstreet	417f01e726	bcachefs: Error ratelimiting is no longer only during fsck We now more often do repair automatically, without the user invoking fsck - and sometimes that can involve fixing lots of errors, so let's avoid flooding the dmesg log. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-20 19:41:38 -04:00
Bharadwaj Raju	570f5050bb	bcachefs: use nonblocking variant of print_string_as_lines in error path The inconsistency error path calls print_string_as_lines, which calls console_lock, which is a potentially-sleeping function and so can't be called in an atomic context. Replace calls to it with the nonblocking variant which is safe to call. Signed-off-by: Bharadwaj Raju <bharadwaj.raju777@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-03 12:11:42 -04:00
Kent Overstreet	b2ffadcc7f	bcachefs: Fix scheduling while atomic from logging changes Two fixes from the recent logging changes: bch2_inconsistent(), bch2_fs_inconsistent() be called from interrupt context, or with rcu_read_lock() held. The one syzbot found is in bch2_bkey_pick_read_device bch2_dev_rcu bch2_fs_inconsistent We're starting to switch to lift the printbufs up to higher levels so we can emit better log messages and print them all in one go (avoid garbling), so that conversion will help with spotting these in the future; when we declare a printbuf it must be flagged if we're in an atomic context. Secondly, in btree_node_write_endio: 00085 BUG: sleeping function called from invalid context at include/linux/sched/mm.h:321 00085 in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 618, name: bch-reclaim/fa6 00085 preempt_count: 10001, expected: 0 00085 RCU nest depth: 0, expected: 0 00085 4 locks held by bch-reclaim/fa6/618: 00085 #0: ffffff80d7ccad68 (&j->reclaim_lock){+.+.}-{4:4}, at: bch2_journal_reclaim_thread+0x84/0x198 00085 #1: ffffff80d7c84218 (&c->btree_trans_barrier){.+.+}-{0:0}, at: __bch2_trans_get+0x1c0/0x440 00085 #2: ffffff80cd3f8140 (bcachefs_btree){+.+.}-{0:0}, at: __bch2_trans_get+0x22c/0x440 00085 #3: ffffff80c3823c20 (&vblk->vqs[i].lock){-.-.}-{3:3}, at: virtblk_done+0x58/0x130 00085 irq event stamp: 328 00085 hardirqs last enabled at (327): [<ffffffc080073a14>] finish_task_switch.isra.0+0xbc/0x2a0 00085 hardirqs last disabled at (328): [<ffffffc080971a10>] el1_interrupt+0x20/0x60 00085 softirqs last enabled at (0): [<ffffffc08002f920>] copy_process+0x7c8/0x2118 00085 softirqs last disabled at (0): [<0000000000000000>] 0x0 00085 Preemption disabled at: 00085 [<ffffffc08003ada0>] irq_enter_rcu+0x18/0x90 00085 CPU: 8 UID: 0 PID: 618 Comm: bch-reclaim/fa6 Not tainted 6.14.0-rc6-ktest-g04630bde23e8 #18798 00085 Hardware name: linux,dummy-virt (DT) 00085 Call trace: 00085 show_stack+0x1c/0x30 (C) 00085 dump_stack_lvl+0x84/0xc0 00085 dump_stack+0x14/0x20 00085 __might_resched+0x180/0x288 00085 __might_sleep+0x4c/0x88 00085 __kmalloc_node_track_caller_noprof+0x34c/0x3e0 00085 krealloc_noprof+0x1a0/0x2d8 00085 bch2_printbuf_make_room+0x9c/0x120 00085 bch2_prt_printf+0x60/0x1b8 00085 btree_node_write_endio+0x1b0/0x2d8 00085 bio_endio+0x138/0x1f0 00085 btree_node_write_endio+0xe8/0x2d8 00085 bio_endio+0x138/0x1f0 00085 blk_update_request+0x220/0x4c0 00085 blk_mq_end_request+0x28/0x148 00085 virtblk_request_done+0x64/0xe8 00085 blk_mq_complete_request+0x34/0x40 00085 virtblk_done+0x78/0x130 00085 vring_interrupt+0x6c/0xb0 00085 __handle_irq_event_percpu+0x8c/0x2e0 00085 handle_irq_event+0x50/0xb0 00085 handle_fasteoi_irq+0xc4/0x250 00085 handle_irq_desc+0x44/0x60 00085 generic_handle_domain_irq+0x20/0x30 00085 gic_handle_irq+0x54/0xc8 00085 call_on_irq_stack+0x24/0x40 Reported-by: syzbot+c82cd2906e2f192410bb@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-03 12:11:42 -04:00
Kent Overstreet	6d77ce4a27	bcachefs: Better printing of inconsistency errors Build up and emit the error message for an inconsistency error all at once, instead of spread over multiple printk calls, so they're not jumbled in the dmesg log. Also, add better indenting. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-29 13:26:13 -04:00
Kent Overstreet	7337f9f14e	bcachefs: bch2_count_fsck_err() Factor out a helper from __bch2_fsck_err(), for counting the error in the superblock and deciding whether to print or ratelimit - will be used to replace some log_fsck_err() calls, where we want to lift out printing the error message. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-29 13:26:13 -04:00
Kent Overstreet	b00750c2e5	bcachefs: Better helpers for inconsistency errors An inconsistency error often happens as part of an event with multiple error messages, and we want to build up one single error message with proper indenting to produce more readable log messages that don't get garbled. Add new helpers that emit messages to a printbuf instead of printing them directly, next patch will convert to use them. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-28 22:31:47 -04:00
Kent Overstreet	1ece53237e	bcachefs: Consistent indentation of multiline fsck errors Add the new helper printbuf_indent_add_nextline(), and use it in __bch2_fsck_err() to centralize setting the indentation of multiline fsck errors. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-28 22:31:47 -04:00
Kent Overstreet	4fcd4de0a6	bcachefs: fs-common.c -> namei.c name <-> inode, code for managing the relationships between inodes and dirents. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-24 09:50:36 -04:00
Kent Overstreet	981e380144	bcachefs: Kick devices out after too many write IO errors We're improving our handling of write errors - we shouldn't write degraded data just because a write failed once, we should retry it (on other devices, if possible). But for this to work, we need to kick devices out when they're only returning errors - otherwise those retries will loop infinitely. This adds a configurable timeout - if writes are failing for too long, we'll set that device read-only. In the future we should also implement more tracking and another knob for an "allowed error rate", so that we can kick out drives that are acting "unhealthy". Another thing we'll want is a mechanism (likely in userspace) for bringing a device back in after a transient error - perhaps a cable was jiggled, or there was a controller reset. After transient errors we also need a mechanism to walk (from the journal) recent btree updates that weren't flushed to that device and treat them as "degraded", since unflushed data may well not have been written. Out of scope for this patch, but becoming relevant. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:16 -04:00
Kent Overstreet	1ccbcd3205	bcachefs: bch2_write_op_error() now prints info about data update A user has been seeing the "error verifying existing checksum while rewriting existing data (memory corruption?)" error. This generally indicates a hardware issue (and that may be the case here), but it might also indicate a bug, in which case we need more information to look for patterns. Reported-by: Roland Vet <vet.roland@protonmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:14 -04:00
Kent Overstreet	06284963e3	bcachefs: bch2_inum_offset_err_msg_trans() no longer handles transaction restarts we're starting to use error messages with paths in fsck_errors(), where we do not want nested transaction restart handling, so let's prepare for that. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:12 -04:00
Kent Overstreet	45f0e6c838	bcachefs: bch2_indirect_extent_missing_error() prints path, not just inode number We want all error messages converted to print paths, not just inode numbers - users want this information, and it speeds up debugging too. Auditing and converting all error messages is going to be a big project, so for the moment we're just doing this incrementally. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:12 -04:00
Kent Overstreet	60558d55f7	bcachefs: Plumb bkey_validate_context to journal_entry_validate This lets us print the exact location in the journal if it was found in the journal, or correctly print if it was found in the superblock. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:22 -05:00
Kent Overstreet	1302eeb7c5	bcachefs: bkey_fsck_err now respects errors_silent Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:21 -05:00
Kent Overstreet	f7727a6767	bcachefs: bch2_inum_to_path() Add a function for walking backpointers to find a path from a given inode number, and convert various error messages to use it. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:21 -05:00
Kent Overstreet	8b10590918	bcachefs: do_fsck_ask_yn() __bch2_fsck_err() is huge, and badly needs more refactoring Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:20 -05:00
Kent Overstreet	052210c3fa	bcachefs: Don't error out when logging fsck error Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:20 -05:00
Kent Overstreet	9963a14da1	bcachefs: BCH_FS_recovery_running If we're autofixing topology errors, we shouldn't shutdown if we're still in recovery. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:20 -05:00
Kent Overstreet	a6f4794fcd	bcachefs: struct bkey_validate_context Add a new parameter to bkey validate functions, and use it to improve invalid bkey error messages: we can now print the btree and depth it came from, or if it came from the journal, or is a btree root. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:20 -05:00
Kent Overstreet	c8e588135c	bcachefs: bch2_bucket_do_index(): inconsistent_err -> fsck_err Factor out a common helper, need_discard_or_freespace_err(), which is now used by both fsck and the runtime checks, and can repair. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:18 -05:00
Kent Overstreet	eb73e7773f	bcachefs: Kill FSCK_NEED_FSCK If we find an error that indicates that we need to run fsck, we can specify that directly with run_explicit_recovery_pass(). These are now log_fsck_err() calls: we're just logging in the superblock that an error occurred - and possibly doing an emergency shutdown, depending on policy. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:17 -05:00
Kent Overstreet	eb5db64c45	bcachefs: Fix __bch2_fsck_err() warning We only warn about having a btree_trans that wasn't passed in if we'll be prompting. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-20 16:50:14 -04:00
Kent Overstreet	492e24d760	bcachefs: Make sure we print error that causes fsck to bail out Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-04 20:25:31 -04:00
Kent Overstreet	658c82f41e	bcachefs: bkey errors are only AUTOFIX during read Newly generated keys, in the transaction commit path or write path, should not be AUTOFIX; those indicate bugs that we need to fail fast for. Fixes: `5612daafb7` ("bcachefs: Fix fsck warnings from bkey validation") Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-04 20:25:31 -04:00
Kent Overstreet	5612daafb7	bcachefs: Fix fsck warnings from bkey validation __bch2_fsck_err() warns if the current task has a btree_trans object and it wasn't passed in, because if it has to prompt for user input it has to be able to unlock it. But plumbing the btree_trans through bkey_validate(), as well as transaction restarts, is problematic - so instead make bkey fsck errors FSCK_AUTOFIX, which doesn't need to warn. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-27 21:46:35 -04:00
Kent Overstreet	d97de0d017	bcachefs: Make bkey_fsck_err() a wrapper around fsck_err() bkey_fsck_err() was added as an interface that looks like fsck_err(), but previously all it did was ensure that the appropriate error counter was incremented in the superblock. This is a cleanup and bugfix patch that converts it to a wrapper around fsck_err(). This is needed to fix an issue with the upgrade path to disk_accounting_v3, where the "silent fix" error list now includes bkey_fsck errors; fsck_err() handles this in a unified way, and since we need to change printing of bkey fsck errors from the caller to the inner bkey_fsck_err() calls, this ends up being a pretty big change. Als,, rename .invalid() methods to .validate(), for clarity, while we're changing the function signature anyways (to drop the printbuf argument). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-13 23:00:50 -04:00
Kent Overstreet	889fb3dc5d	bcachefs: Unlock trans when waiting for user input in fsck We can't hold locks while waiting for user input, that's a deadlock. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-07-14 19:00:14 -04:00
Kent Overstreet	0c97c437e3	bcachefs: twf: convert bch2_stdio_redirect_readline() to darray We now read the line from the buffer atomically, which means we have to allow the buffer to grow past STDIO_REDIRECT_BUFSIZE if we're waiting for a full line - this behaviour is necessary for stdio_redirect_readline_timeout() in the next patch. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-07-14 19:00:14 -04:00
Kent Overstreet	a850bde649	bcachefs: fsck_err() may now take a btree_trans fsck_err() now optionally takes a btree_trans; if the current thread has one, it is required that it be passed. The next patch will use this to unlock when waiting for user input. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-07-14 19:00:14 -04:00
Kent Overstreet	e76a2b65b0	bcachefs: add might_sleep() annotations for fsck_err() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-07-14 19:00:12 -04:00
Kent Overstreet	33dfafa902	bcachefs: Fix safe errors by default i.e. the start of automatic self healing: If errors=continue or fix_safe, we now automatically fix simple errors without user intervention. New error action option: fix_safe This replaces the existing errors=ro option, which gets a new slot, i.e. existing errors=ro users now get errors=fix_safe. This is currently only enabled for a limited set of errors - initially just disk accounting; errors we would never not want to fix, and we don't want to require user intervention (i.e. to make sure a bug report gets filed). Errors will still be counted in the superblock, so we (developers) will still know they've been occuring if a bug report gets filed (as bug reports typically include the errors superblock section). Eventually we'll be enabling this for a much wider set of errors, after we've done thorough error injection testing. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-06-20 09:13:09 -04:00
Kent Overstreet	19391b9294	bcachefs: allow for custom action in fsck error messages Be more explicit to the user about what we're doing. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:18 -04:00
Kent Overstreet	b3c7fd35c0	bcachefs: On emergency shutdown, print out current journal sequence number Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-04-01 01:07:24 -04:00
Kent Overstreet	d2554263ad	bcachefs: Split out recovery_passes.c We've grown a fair amount of code for managing recovery passes; tracking which ones we're running, which ones need to be run, and flagging in the superblock which ones need to be run on the next recovery. So it's worth splitting out into its own file, this code is pretty different from the code in recovery.c. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-03-31 20:36:11 -04:00
Kent Overstreet	52946d828a	bcachefs: Kill more -EIO error codes This converts -EIOs related to btree node errors to private error codes, which will help with some ongoing debugging by giving us better error messages. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-03-13 21:22:23 -04:00
Kent Overstreet	a64a37338d	bcachefs: Don't autofix errors we can't fix Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-01-05 23:24:20 -05:00
Kent Overstreet	d55ddf6e7a	bcachefs: Online fsck can now fix errors BCH_FS_fsck_done -> BCH_FS_fsck_running; set when we might be fixing fsck errors. Also; set fix_errors to ask by default when fsck is running. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-01-05 23:24:20 -05:00
Kent Overstreet	96f37eabe7	bcachefs: factor out thread_with_file, thread_with_stdio thread_with_stdio now knows how to handle input - fsck can now prompt to fix errors. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-01-05 23:24:19 -05:00
Kent Overstreet	3c471b6588	bcachefs: convert bch_fs_flags to x-macro Now we can print out filesystem flags in sysfs, useful for debugging various "what's my filesystem doing" issues. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-01-01 11:47:38 -05:00
Kent Overstreet	8b16413cda	bcachefs: bch_sb.recovery_passes_required Add two new superblock fields. Since the main section of the superblock is now fully, we have to add a new variable length section for them - bch_sb_field_ext. - recovery_passes_requried: recovery passes that must be run on the next mount - errors_silent: errors that will be silently fixed These are to improve upgrading and dwongrading: these fields won't be cleared until after recovery successfully completes, so there won't be any issues with crashing partway through an upgrade or a downgrade. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-01-01 11:47:07 -05:00
Kent Overstreet	b65db750e2	bcachefs: Enumerate fsck errors This patch adds a superblock error counter for every distinct fsck error; this means that when analyzing filesystems out in the wild we'll be able to see what sorts of inconsistencies are being found and repair, and hence what bugs to look for. Errors validating bkeys are not yet considered distinct fsck errors, but this patch adds a new helper, bkey_fsck_err(), in order to add distinct error types for them as well. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-11-01 21:11:08 -04:00
Kent Overstreet	f5d26fa31e	bcachefs: bch_sb_field_errors Add a new superblock section to keep counts of errors seen since filesystem creation: we'll be addingcounters for every distinct fsck error. The new superblock section has entries of the for [ id, count, time_of_last_error ]; this is intended to let us see what errors are occuring - and getting fixed - via show-super output. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-11-01 21:11:08 -04:00
Kent Overstreet	94119eeb02	bcachefs: Add IO error counts to bch_member We now track IO errors per device since filesystem creation. IO error counts can be viewed in sysfs, or with the 'bcachefs show-super' command. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-11-01 21:11:08 -04:00
Kent Overstreet	1809b8cba7	bcachefs: Break up io.c More reorganization, this splits up io.c into - io_read.c - io_misc.c - fallocate, fpunch, truncate - io_write.c Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:10:12 -04:00
Kent Overstreet	922bc5a037	bcachefs: Make topology repair a normal recovery pass This adds bch2_run_explicit_recovery_pass(), for rewinding recovery and explicitly running a specific recovery pass - this is a more general replacement for how we were running topology repair before. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:10:08 -04:00
Kent Overstreet	a0f8faea5f	bcachefs: fix_errors option is now a proper enum Before, it was parsed as a bool but internally it was really an enum: this lets us pass in all the possible values. But we special case the option parsing: no supplied value is parsed as FSCK_FIX_yes, to match the previous behaviour. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:10:07 -04:00
Kent Overstreet	4f2c166ebe	bcachefs: Fix bch2_fsck_ask_yn() - getline() output includes a newline, without stripping that we were just looping - Make the prompt clearer Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:10:03 -04:00
Kent Overstreet	c8d5b71411	bcachefs: Make sure hash info gets initialized in fsck We had some bugs with setting/using first_this_inode in the inode walker in the dirents/xattr code. This patch changes to not clear first_this_inode until after initializing the new hash info. Also, we fix an error message to not print on transaction restart, and add a comment to related fsck error code. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:10:00 -04:00
Kent Overstreet	853b7393c2	bcachefs: Allow answering y or n to all fsck errors of given type This changes the ask_yn() function used by fsck to accept Y or N, meaning yes or no for all errors of a given type. With this, the user can be prompted only for distinct error types - useful when a filesystem has lots of errors. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:10:00 -04:00

1 2

68 Commits