This slightly changes how trans->journal_res works, in preparation for
changing the btree write buffer flush path to use it.
Now, BTREE_INSERT_JOURNAL_REPLAY means "don't take a journal
reservation; trans->journal_res.seq already refers to the journal
sequence number to pin".
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The upcoming btree write buffer rework is going to use the journal
itself as the first stage of the write buffer; this is a cleanup to make
sure k->needs_whiteout is initialized before keys hit the journal.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This introduces a new helper for connecting time_stats to state changes,
i.e. when taking journal reservations is blocked for some reason.
We use this to track separately the different reasons the journal might
be blocked - i.e. space in the journal full, or the journal pin fifo
full.
Also do some cleanup and improvements on the time stats code.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Previously, bch2_journal_pin_set() would silently ignore a request to
pin a journal sequence number that was no longer dirty, because it was
used internally by bch2_journal_pin_copy() which could race with the src
pin being flushed.
Split these apart so that we can properly assert that @seq is a
currently dirty journal sequence number - this is almost always a bug.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
In an ideal world, we'd have a common helper that could be used for
sorting a list of inodes into the correct lock order, and then the same
lock ordering could be used for any type of inode lock, not just
i_rwsem.
But the lock ordering rules for i_rwsem are a bit complicated, so -
abandon that dream for now and do it the more standard way.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Also log time waiting for c->writes references to be dropped; this will
help in debugging why unmounts are taking longer than they should.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
It's confusing if we run fsck a second time (in debug mode, to verify
the second run is clean), but errors are still ratelimited from the
first run.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Add checks to all the VFS paths for "are we in a RO snapshot?".
Note - we don't check this when setting inode options via our xattr
interface, since those generally only affect data placement, not
contents of data.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Reported-by: "Carl E. Thompson" <list-bcachefs@carlthompson.net>
Add a new superblock section that contains a list of
{ minor version, recovery passes, errors_to_fix }
that is - a list of recovery passes that must be run when downgrading
past a given version, and a list of errors to silently fix.
The upcoming disk accounting rewrite is not going to be fully
compatible: we're going to have to regenerate accounting both when
upgrading to the new version, and also from downgrading from the new
version, since the new method of doing disk space accounting is a
completely different architecture based on deltas, and synchronizing
them for every jounal entry write to maintain compatibility is going to
be too expensive and impractical.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Add two new superblock fields. Since the main section of the superblock
is now fully, we have to add a new variable length section for them -
bch_sb_field_ext.
- recovery_passes_requried: recovery passes that must be run on the
next mount
- errors_silent: errors that will be silently fixed
These are to improve upgrading and dwongrading: these fields won't be
cleared until after recovery successfully completes, so there won't be
any issues with crashing partway through an upgrade or a downgrade.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The next patch will start to refer to recovery passes from the
superblock; naturally, we now need identifiers that don't change, since
the existing enum is in the order in which they are run and is not
fixed.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
BCH_REPLICAS_MAX isn't the actual maximum number of pointers in an
extent, it's the maximum number of dirty pointers.
We don't have a real restriction on the number of cached pointers, and
we don't want a fixed size array here anyways - so switch to
DARRAY_PREALLOCATED().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Reported-and-tested-by: Daniel J Blueman <daniel@quora.org>
We sometimes use darrays for quite large buffers - the btree write
buffer in particular needs large buffers, since it must be sized to hold
all the write buffer keys outstanding in the journal.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Move the slowpath (actually growing the darray) to an out-of-line
function; also, add some helpers for the upcoming btree write buffer
rewrite.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
If a superblock write hasn't happened (i.e. we never had to go rw), then
c->sb.version will be out of date w.r.t. c->disk_sb.sb->version.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
turns out iterate_iovec() mutates __iov, we need to save our own copy
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Reported-by: Marcin Mirosław <marcin@mejor.pl>
peek_upto() checks against the end position and bails out before
FILTER_SNAPSHOTS checks; this is because if we end up at a different
inode number than the original search key none of the keys we see might
be visibile in the current snapshot - we might be looking at inode in a
completely different subvolume.
But this is broken, because when we're iterating over extents we're
checking against the extent start position to decide when to bail out,
and the extent start position isn't monotonically increasing until after
we've run FILTER_SNAPSHOTS.
Fix this by adding a simple inode number check where the old bailout
check was, and moving the main check to the correct position.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Reported-by: "Carl E. Thompson" <list-bcachefs@carlthompson.net>
Pull tracing fixes from Steven Rostedt:
- Fix readers that are blocked on the ring buffer when buffer_percent
is 100%. They are supposed to wake up when the buffer is full, but
because the sub-buffer that the writer is on is never considered
"dirty" in the calculation, dirty pages will never equal nr_pages.
Add +1 to the dirty count in order to count for the sub-buffer that
the writer is on.
- When a reader is blocked on the "snapshot_raw" file, it is to be
woken up when a snapshot is done and be able to read the snapshot
buffer. But because the snapshot swaps the buffers (the main one with
the snapshot one), and the snapshot reader is waiting on the old
snapshot buffer, it was not woken up (because it is now on the main
buffer after the swap). Worse yet, when it reads the buffer after a
snapshot, it's not reading the snapshot buffer, it's reading the live
active main buffer.
Fix this by forcing a wakeup of all readers on the snapshot buffer
when a new snapshot happens, and then update the buffer that the
reader is reading to be back on the snapshot buffer.
- Fix the modification of the direct_function hash. There was a race
when new functions were added to the direct_function hash as when it
moved function entries from the old hash to the new one, a direct
function trace could be hit and not see its entry.
This is fixed by allocating the new hash, copy all the old entries
onto it as well as the new entries, and then use rcu_assign_pointer()
to update the new direct_function hash with it.
This also fixes a memory leak in that code.
- Fix eventfs ownership
* tag 'trace-v6.7-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
ftrace: Fix modification of direct_function hash while in use
tracing: Fix blocked reader of snapshot buffer
ring-buffer: Fix wake ups when buffer_percent is set to 100
eventfs: Fix file and directory uid and gid ownership
osq_wait_next() is passed 'prev' from osq_lock() and NULL from
osq_unlock() but only needs the 'cpu' value to write to lock->tail.
Just pass prev->cpu or OSQ_UNLOCKED_VAL instead.
Should have no effect on the generated code since gcc manages to assume
that 'prev != NULL' due to an earlier dereference.
Signed-off-by: David Laight <david.laight@aculab.com>
[ Changed 'old' to 'old_cpu' by request from Waiman Long - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Masami Hiramatsu reported a memory leak in register_ftrace_direct() where
if the number of new entries are added is large enough to cause two
allocations in the loop:
for (i = 0; i < size; i++) {
hlist_for_each_entry(entry, &hash->buckets[i], hlist) {
new = ftrace_add_rec_direct(entry->ip, addr, &free_hash);
if (!new)
goto out_remove;
entry->direct = addr;
}
}
Where ftrace_add_rec_direct() has:
if (ftrace_hash_empty(direct_functions) ||
direct_functions->count > 2 * (1 << direct_functions->size_bits)) {
struct ftrace_hash *new_hash;
int size = ftrace_hash_empty(direct_functions) ? 0 :
direct_functions->count + 1;
if (size < 32)
size = 32;
new_hash = dup_hash(direct_functions, size);
if (!new_hash)
return NULL;
*free_hash = direct_functions;
direct_functions = new_hash;
}
The "*free_hash = direct_functions;" can happen twice, losing the previous
allocation of direct_functions.
But this also exposed a more serious bug.
The modification of direct_functions above is not safe. As
direct_functions can be referenced at any time to find what direct caller
it should call, the time between:
new_hash = dup_hash(direct_functions, size);
and
direct_functions = new_hash;
can have a race with another CPU (or even this one if it gets interrupted),
and the entries being moved to the new hash are not referenced.
That's because the "dup_hash()" is really misnamed and is really a
"move_hash()". It moves the entries from the old hash to the new one.
Now even if that was changed, this code is not proper as direct_functions
should not be updated until the end. That is the best way to handle
function reference changes, and is the way other parts of ftrace handles
this.
The following is done:
1. Change add_hash_entry() to return the entry it created and inserted
into the hash, and not just return success or not.
2. Replace ftrace_add_rec_direct() with add_hash_entry(), and remove
the former.
3. Allocate a "new_hash" at the start that is made for holding both the
new hash entries as well as the existing entries in direct_functions.
4. Copy (not move) the direct_function entries over to the new_hash.
5. Copy the entries of the added hash to the new_hash.
6. If everything succeeds, then use rcu_pointer_assign() to update the
direct_functions with the new_hash.
This simplifies the code and fixes both the memory leak as well as the
race condition mentioned above.
Link: https://lore.kernel.org/all/170368070504.42064.8960569647118388081.stgit@devnote2/
Link: https://lore.kernel.org/linux-trace-kernel/20231229115134.08dd5174@gandalf.local.home
Cc: stable@vger.kernel.org
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Fixes: 763e34e74b ("ftrace: Add register_ftrace_direct()")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Pull gpio fixes from Bartosz Golaszewski:
- Andy steps down as GPIO reviewer
- Kent becomes a reviewer for GPIO uAPI
- add missing intel file to the relevant MAINTAINERS section
* tag 'gpio-fixes-for-v6.7-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux:
MAINTAINERS: Add a missing file to the INTEL GPIO section
MAINTAINERS: Remove Andy from GPIO maintainers
MAINTAINERS: split out the uAPI into a new section
Pull block fixes from Jens Axboe:
"Fix for a badly numbered flag, and a regression fix for the badblocks
updates from this merge window"
* tag 'block-6.7-2023-12-29' of git://git.kernel.dk/linux:
block: renumber QUEUE_FLAG_HW_WC
badblocks: avoid checking invalid range in badblocks_check()
If an application blocks on the snapshot or snapshot_raw files, expecting
to be woken up when a snapshot occurs, it will not happen. Or it may
happen with an unexpected result.
That result is that the application will be reading the main buffer
instead of the snapshot buffer. That is because when the snapshot occurs,
the main and snapshot buffers are swapped. But the reader has a descriptor
still pointing to the buffer that it originally connected to.
This is fine for the main buffer readers, as they may be blocked waiting
for a watermark to be hit, and when a snapshot occurs, the data that the
main readers want is now on the snapshot buffer.
But for waiters of the snapshot buffer, they are waiting for an event to
occur that will trigger the snapshot and they can then consume it quickly
to save the snapshot before the next snapshot occurs. But to do this, they
need to read the new snapshot buffer, not the old one that is now
receiving new data.
Also, it does not make sense to have a watermark "buffer_percent" on the
snapshot buffer, as the snapshot buffer is static and does not receive new
data except all at once.
Link: https://lore.kernel.org/linux-trace-kernel/20231228095149.77f5b45d@gandalf.local.home
Cc: stable@vger.kernel.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Fixes: debdd57f51 ("tracing: Make a snapshot feature available from userspace")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The tracefs file "buffer_percent" is to allow user space to set a
water-mark on how much of the tracing ring buffer needs to be filled in
order to wake up a blocked reader.
0 - is to wait until any data is in the buffer
1 - is to wait for 1% of the sub buffers to be filled
50 - would be half of the sub buffers are filled with data
100 - is not to wake the waiter until the ring buffer is completely full
Unfortunately the test for being full was:
dirty = ring_buffer_nr_dirty_pages(buffer, cpu);
return (dirty * 100) > (full * nr_pages);
Where "full" is the value for "buffer_percent".
There is two issues with the above when full == 100.
1. dirty * 100 > 100 * nr_pages will never be true
That is, the above is basically saying that if the user sets
buffer_percent to 100, more pages need to be dirty than exist in the
ring buffer!
2. The page that the writer is on is never considered dirty, as dirty
pages are only those that are full. When the writer goes to a new
sub-buffer, it clears the contents of that sub-buffer.
That is, even if the check was ">=" it would still not be equal as the
most pages that can be considered "dirty" is nr_pages - 1.
To fix this, add one to dirty and use ">=" in the compare.
Link: https://lore.kernel.org/linux-trace-kernel/20231226125902.4a057f1d@gandalf.local.home
Cc: stable@vger.kernel.org
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Fixes: 03329f9939 ("tracing: Add tracefs file buffer_percentage")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Commit 804951203a ("platform/x86:intel/pmc: Combine core_init() and
core_configure()") caused a network performance regression due to the GBE
LTR ignore that it added at probe. This was needed in order to allow the
SoC to enter the deepest Package C state. To fix the regression and at
least support PC10 during suspend, move the LTR ignore from probe to the
suspend callback, and enable it again on resume. This solution will allow
PC10 during suspend but restrict Package C entry at runtime to no deeper
than PC8/9 while a network cable it attach to the PCH LAN.
Fixes: 804951203a ("platform/x86:intel/pmc: Combine core_init() and core_configure()")
Signed-off-by: "David E. Box" <david.e.box@linux.intel.com>
Link: https://lore.kernel.org/r/20231223032548.1680738-6-david.e.box@linux.intel.com
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Commit 804951203a ("platform/x86:intel/pmc: Combine core_init() and
core_configure()") caused a network performance regression due to the GBE
LTR ignore that it added during probe. The fix will move the ignore to
occur at suspend-time (so as to not affect suspend power). This will
require the ability to enable the LTR again on resume. Modify
pmc_core_send_ltr_ignore() to allow enabling an LTR.
Fixes: 804951203a ("platform/x86:intel/pmc: Combine core_init() and core_configure()")
Signed-off-by: "David E. Box" <david.e.box@linux.intel.com>
Link: https://lore.kernel.org/r/20231223032548.1680738-5-david.e.box@linux.intel.com
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Add a suspend callback to struct pmc for performing platform specific tasks
before device suspend. This is needed in order to perform GBE LTR ignore on
certain platforms at suspend-time instead of at probe-time and replace the
GBE LTR ignore removal that was done in order to fix a bug introduced by
commit 804951203a ("platform/x86:intel/pmc: Combine core_init() and
core_configure()").
Fixes: 804951203a ("platform/x86:intel/pmc: Combine core_init() and core_configure()")
Signed-off-by: "David E. Box" <david.e.box@linux.intel.com>
Link: https://lore.kernel.org/r/20231223032548.1680738-4-david.e.box@linux.intel.com
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Pull ksmbd server fix from Steve French:
- address possible slab out of bounds in parsing of open requests
* tag '6.7rc7-smb3-srv-fix' of git://git.samba.org/ksmbd:
ksmbd: fix slab-out-of-bounds in smb_strndup_from_utf16()
Pull Kbuild fixes from Masahiro Yamada:
- Revive proper alignment for the ksymtab and kcrctab sections
- Fix gen_compile_commands.py tool to resolve symbolic links
- Fix symbolic links to installed debug VDSO files
- Update MAINTAINERS
* tag 'kbuild-fixes-v6.7-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
linux/export: Ensure natural alignment of kcrctab array
kbuild: fix build ID symlinks to installed debug VDSO files
gen_compile_commands.py: fix path resolve with symlinks in it
MAINTAINERS: Add scripts/clang-tools to Kbuild section
linux/export: Fix alignment for 64-bit ksymtab entries
Pull bcachefs fixes from Kent Overstreet:
"Just a few fixes: besides a few one liners, we have a fix for
snapshots + compression where the extent update path didn't account
for the fact that with snapshots, we might split an existing extent
into three, not just two; and a small fixup for promotes which were
broken by the recent changes in the data update path to correctly take
into account device durability"
* tag 'bcachefs-2023-12-27' of https://evilpiepirate.org/git/bcachefs:
bcachefs: Fix promotes
bcachefs: Fix leakage of internal error code
bcachefs: Fix insufficient disk reservation with compression + snapshots
bcachefs: fix BCH_FSCK_ERR enum
The ___kcrctab section holds an array of 32-bit CRC values.
Add a .balign 4 to tell the linker the correct memory alignment.
Fixes: f3304ecd7f ("linux/export: use inline assembler to populate symbol CRCs")
Signed-off-by: Helge Deller <deller@gmx.de>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
If ->NameOffset/Length is bigger than ->CreateContextsOffset/Length,
ksmbd_check_message doesn't validate request buffer it correctly.
So slab-out-of-bounds warning from calling smb_strndup_from_utf16()
in smb2_open() could happen. If ->NameLength is non-zero, Set the larger
of the two sums (Name and CreateContext size) as the offset and length of
the data area.
Reported-by: Yang Chaoming <lometsj@live.com>
Cc: stable@vger.kernel.org
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Pull misc fixes from Andrew Morton:
"11 hotfixes. 7 are cc:stable and the other 4 address post-6.6 issues
or are not considered backporting material"
* tag 'mm-hotfixes-stable-2023-12-27-15-00' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
mailmap: add an old address for Naoya Horiguchi
mm/memory-failure: cast index to loff_t before shifting it
mm/memory-failure: check the mapcount of the precise page
mm/memory-failure: pass the folio and the page to collect_procs()
selftests: secretmem: floor the memory size to the multiple of page_size
mm: migrate high-order folios in swap cache correctly
maple_tree: do not preallocate nodes for slot stores
mm/filemap: avoid buffered read/write race to read inconsistent data
kunit: kasan_test: disable fortify string checker on kmalloc_oob_memset
kexec: select CRYPTO from KEXEC_FILE instead of depending on it
kexec: fix KEXEC_FILE dependencies
When gpio-tangier was split the new born headers had been missed
in the MAINTAINERS. Add it there.
Fixes: d2c19e89e0 ("gpio: tangier: Introduce Intel Tangier GPIO driver")
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
Kent Gibson is the author of the character device uAPI v2 and should be
Cc'ed on all patches aimed for it. Unfortunately this is not the case as
he's not listed in MAINTAINERS. Split the uAPI files into their own
section and make Kent the reviewer.
Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
Acked-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Linus Walleij <linus.walleij@linaro.org>