reading the journal can take a decent amount of time compared to the
rest of fsck, let's only read it when required.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
- fix assorted (harmless) off-by-one errors
- we were inconsistent on whether out->pos stays <= out->size on
overflow; now it does, and printbuf.overflow exists to indicate if a
printbuf has overflowed
- factor out printbuf_advance_pos()
- printbuf_nul_terminate_reserved(); use this to reduce the number of
printbuf_make_room() calls
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
When a superblock write is silently dropped or it's been modified by
another process we need to know which device it was.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Btree nodes are log structured; thus, we need to emit whiteouts when
we're deleting a key that's been written out to disk.
k->needs_whiteout tracks whether a key will need a whiteout when it's
deleted, and this requires some careful handling; e.g. the key we're
deleting may not have been written out to disk, but it may have
overwritten a key that was - thus we need to carry this flag around on
overwrites.
Invariants:
There may be multiple key for the same position in a given node (because
of overwrites), but only one of them will be a live (non deleted) key,
and only one key for a given position will have the needs_whiteout flag
set.
Additionally, we don't want to carry around whiteouts that need to be
written in the main searchable part of a btree node - btree_iter_peek()
will have to skip past them, and this can lead to an O(n^2) issues when
doing sequential deletions (e.g. inode rm/truncate). So there's a
separate region in the btree node buffer for unwritten whiteouts; these
are merge sorted with the rest of the keys we're writing in the btree
node write path.
The unwritten whiteouts was a later optimization that bch2_sort_keys()
didn't take into account; the unwritten whiteouts area means that we
never have deleted keys with needs_whiteout set in the main searchable
part of a btree node.
That means we can simplify and optimize some sort paths, and eliminate
an assertion that syzbot found:
- Unless we're in the btree node write path, it's always ok to drop
whiteouts when sorting
- When sorting for a btree node write, we drop the whiteout if it's not
from the unwritten whiteouts area, or if it's overwritten by a real
key at the same position.
This completely eliminates some tricky logic for propagating the
needs_whiteout flag: syzbot was able to hit the assertion that checked
that there shouldn't be more than one key at the same pos with
needs_whiteout set, likely due to a combination of flipping on
needs_whiteout on all written keys (they need whiteouts if overwritten),
combined with not always dropping unneeded whiteouts, and the tricky
logic in the sort path for preserving needs_whiteout that wasn't really
needed.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bch2_write_super() was looping over online devices multiple times -
dropping and retaking io_ref each time.
This meant it could race with device removal; it could increment the
sequence number on a device but fail to write it - and then if the
device was re-added, it would get confused the next time around thinking
a superblock write was silently dropped.
Fix this by taking io_ref once, and stashing pointers to online devices
in a darray.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Ancient versions of bcachefs produced packed formats that could
represent keys that our in memory format cannot represent;
bformat_needs_redo() has some tricky shifts to check for this sort of
overflow.
Reported-by: syzbot+594427aebfefeebe91c6@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
filefrag (and potentially other utilities that call fiemap) sometimes
pass ULONG_MAX as the length. fiemap_prep clamps excessively large
lengths - but the calculation of end can overflow if it occurs before
calling fiemap_prep. When this happens, filefrag assumes it has read to
the end and exits.
Signed-off-by: Reed Riley <reed@riley.engineer>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The bucket_gens array is a single array allocation (one byte per
bucket), and kernel allocations are still limited to INT_MAX.
Check this limit to avoid failing the bucket_gens array allocation.
Reported-by: syzbot+b29f436493184ea42e2b@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bch2_get_next_dev() and bch2_get_next_online_dev() iterate over devices,
dropping and taking refs as they go; we can't access the previous device
(for ca->dev_idx) after we've dropped our ref to it, unless we take
rcu_read_lock() first.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bch2_dev_lookup() is supposed to take a ref on the device it returns, but
for_each_member_device() takes refs as it iterates,
for_each_member_device_rcu() does not.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We're using mutex_lock() inside a wait_event() conditional -
prepare_to_wait() has already flipped task state, so potentially
blocking ops need annotation.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Pull scheduler fixes from Ingo Molnar:
- Fix EEVDF corner cases
- Fix two nohz_full= related bugs that can cause boot crashes
and warnings
* tag 'sched-urgent-2024-04-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/isolation: Fix boot crash when maxcpus < first housekeeping CPU
sched/isolation: Prevent boot crash when the boot CPU is nohz_full
sched/eevdf: Prevent vlag from going out of bounds in reweight_eevdf()
sched/eevdf: Fix miscalculation in reweight_entity() when se is not curr
sched/eevdf: Always update V if se->on_rq when reweighting
Pull x86 fixes from Ingo Molnar:
- Make the CPU_MITIGATIONS=n interaction with conflicting
mitigation-enabling boot parameters a bit saner.
- Re-enable CPU mitigations by default on non-x86
- Fix TDX shared bit propagation on mprotect()
- Fix potential show_regs() system hang when PKE initialization
is not fully finished yet.
- Add the 0x10-0x1f model IDs to the Zen5 range
- Harden #VC instruction emulation some more
* tag 'x86-urgent-2024-04-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
cpu: Ignore "mitigations" kernel parameter if CPU_MITIGATIONS=n
cpu: Re-enable CPU mitigations by default for !X86 architectures
x86/tdx: Preserve shared bit on mprotect()
x86/cpu: Fix check for RDPKRU in __show_regs()
x86/CPU/AMD: Add models 0x10-0x1f to the Zen5 range
x86/sev: Check for MWAITX and MONITORX opcodes in the #VC handler
Pull irq fix from Ingo Molnar:
"Fix a double free bug in the init error path of the GICv3 irqchip
driver"
* tag 'irq-urgent-2024-04-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip/gic-v3-its: Prevent double free on error
housekeeping_setup() checks cpumask_intersects(present, online) to ensure
that the kernel will have at least one housekeeping CPU after smp_init(),
but this doesn't work if the maxcpus= kernel parameter limits the number of
processors available after bootup.
For example, a kernel with "maxcpus=2 nohz_full=0-2" parameters crashes at
boot time on a virtual machine with 4 CPUs.
Change housekeeping_setup() to use cpumask_first_and() and check that the
returned CPU number is valid and less than setup_max_cpus.
Another corner case is "nohz_full=0" on a machine with a single CPU or with
the maxcpus=1 kernel argument. In this case non_housekeeping_mask is empty
and tick_nohz_full_setup() makes no sense. And indeed, the kernel hits the
WARN_ON(tick_nohz_full_running) in tick_sched_do_timer().
And how should the kernel interpret the "nohz_full=" parameter? It should
be silently ignored, but currently cpulist_parse() happily returns the
empty cpumask and this leads to the same problem.
Change housekeeping_setup() to check cpumask_empty(non_housekeeping_mask)
and do nothing in this case.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20240413141746.GA10008@redhat.com
Documentation/timers/no_hz.rst states that the "nohz_full=" mask must not
include the boot CPU, which is no longer true after:
08ae95f4fd ("nohz_full: Allow the boot CPU to be nohz_full").
However after:
aae17ebb53 ("workqueue: Avoid using isolated cpus' timers on queue_delayed_work")
the kernel will crash at boot time in this case; housekeeping_any_cpu()
returns an invalid CPU number until smp_init() brings the first
housekeeping CPU up.
Change housekeeping_any_cpu() to check the result of cpumask_any_and() and
return smp_processor_id() in this case.
This is just the simple and backportable workaround which fixes the
symptom, but smp_processor_id() at boot time should be safe at least for
type == HK_TYPE_TIMER, this more or less matches the tick_do_timer_boot_cpu
logic.
There is no worry about cpu_down(); tick_nohz_cpu_down() will not allow to
offline tick_do_timer_cpu (the 1st online housekeeping CPU).
Fixes: aae17ebb53 ("workqueue: Avoid using isolated cpus' timers on queue_delayed_work")
Reported-by: Chris von Recklinghausen <crecklin@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20240411143905.GA19288@redhat.com
Closes: https://lore.kernel.org/all/20240402105847.GA24832@redhat.com/
Pull Rust fixes from Miguel Ojeda:
- Soundness: make internal functions generated by the 'module!' macro
inaccessible, do not implement 'Zeroable' for 'Infallible' and
require 'Send' for the 'Module' trait.
- Build: avoid errors with "empty" files and workaround 'rustdoc' ICE.
- Kconfig: depend on '!CFI_CLANG' and avoid selecting 'CONSTRUCTORS'.
- Code docs: remove non-existing key from 'module!' macro example.
- Docs: trivial rendering fix in arch table.
* tag 'rust-fixes-6.9' of https://github.com/Rust-for-Linux/linux:
rust: remove `params` from `module` macro example
kbuild: rust: force `alloc` extern to allow "empty" Rust files
kbuild: rust: remove unneeded `@rustc_cfg` to avoid ICE
rust: kernel: require `Send` for `Module` implementations
rust: phy: implement `Send` for `Registration`
rust: make mutually exclusive with CFI_CLANG
rust: macros: fix soundness issue in `module!` macro
rust: init: remove impl Zeroable for Infallible
docs: rust: fix improper rendering in Arch Support page
rust: don't select CONSTRUCTORS
Pull RISC-V fixes from Palmer Dabbelt:
- A fix for TASK_SIZE on rv64/NOMMU, to reflect the lack of user/kernel
separation
- A fix to avoid loading rv64/NOMMU kernel past the start of RAM
- A fix for RISCV_HWPROBE_EXT_ZVFHMIN on ilp32 to avoid signed integer
overflow in the bitmask
- The sud_test kselftest has been fixed to properly swizzle the syscall
number into the return register, which are not the same on RISC-V
- A fix for a build warning in the perf tools on rv32
- A fix for the CBO selftests, to avoid non-constants leaking into the
inline asm
- A pair of fixes for T-Head PBMT errata probing, which has been
renamed MAE by the vendor
* tag 'riscv-for-linus-6.9-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
RISC-V: selftests: cbo: Ensure asm operands match constraints, take 2
perf riscv: Fix the warning due to the incompatible type
riscv: T-Head: Test availability bit before enabling MAE errata
riscv: thead: Rename T-Head PBMT to MAE
selftests: sud_test: return correct emulated syscall value on RISC-V
riscv: hwprobe: fix invalid sign extension for RISCV_HWPROBE_EXT_ZVFHMIN
riscv: Fix loading 64-bit NOMMU kernels past the start of RAM
riscv: Fix TASK_SIZE on 64-bit NOMMU
Pull smb client fixes from Steve French:
"Three smb3 client fixes, all also for stable:
- two small locking fixes spotted by Coverity
- FILE_ALL_INFO and network_open_info packing fix"
* tag '6.9-rc5-cifs-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6:
smb3: fix lock ordering potential deadlock in cifs_sync_mid_result
smb3: missing lock when picking channel
smb: client: Fix struct_group() usage in __packed structs
Pull i2c fixes from Wolfram Sang:
"Fix a race condition in the at24 eeprom handler, a NULL pointer
exception in the I2C core for controllers only using target modes,
drop a MAINTAINERS entry, and fix an incorrect DT binding for at24"
* tag 'i2c-for-6.9-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
i2c: smbus: fix NULL function pointer dereference
MAINTAINERS: Drop entry for PCA9541 bus master selector
eeprom: at24: fix memory corruption race condition
dt-bindings: eeprom: at24: Fix ST M24C64-D compatible schema