Commit Graph

45240 Commits

Author SHA1 Message Date
Thomas Gleixner
ecb5e1aa82 printk: nbcon: Implement emergency sections
In emergency situations (something has gone wrong but the
system continues to operate), usually important information
(such as a backtrace) is generated via printk(). This
information should be pushed out to the consoles ASAP.

Add per-CPU emergency nesting tracking because an emergency
can arise while in an emergency situation.

Add functions to mark the beginning and end of emergency
sections where the urgent messages are generated.

Perform direct console flushing at the emergency priority if
the current CPU is in an emergency state and it is safe to do
so.

Note that the emergency state is not system-wide. While one CPU
is in an emergency state, another CPU may attempt to print
console messages at normal priority.

Also note that printk() already attempts to flush consoles in
the caller context for normal priority. However, follow-up
changes will introduce printing kthreads, in which case the
normal priority printk() calls will offload to the kthreads.

Co-developed-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: Thomas Gleixner (Intel) <tglx@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240820063001.36405-32-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-08-21 15:03:04 +02:00
John Ogness
6690d6b527 printk: Add helper for flush type logic
There are many call sites where console flushing occur.
Depending on the system state and types of consoles, the flush
methods to use are different. A flush call site generally must
consider:

	@have_boot_console
	@have_nbcon_console
	@have_legacy_console
	@legacy_allow_panic_sync
	is_printk_preferred()

and take into account the current CPU state:

	NBCON_PRIO_NORMAL
	NBCON_PRIO_EMERGENCY
	NBCON_PRIO_PANIC

in order to decide if it should:

	flush nbcon directly via atomic_write() callback
	flush legacy directly via console_unlock
	flush legacy via offload to irq_work

All of these call sites use their own logic to make this
decision, which is complicated and error prone. Especially
later when two more flush methods will be introduced:

	flush nbcon via offload to kthread
	flush legacy via offload to kthread

Introduce a new internal struct console_flush_type that specifies
which console flushing methods should be used in the context of
the caller.

Introduce a helper function to fill out console_flush_type to
be used for flushing call sites.

Replace the logic of all flushing call sites to use the new
helper.

This change standardizes behavior, leading to both fixes and
optimizations across various call sites. For instance, in
console_cpu_notify(), the new logic ensures that nbcon consoles
are flushed when they aren’t managed by the legacy loop.
Similarly, in console_flush_on_panic(), the system no longer
needs to flush nbcon consoles if none are present.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240820063001.36405-31-john.ogness@linutronix.de
[pmladek@suse.com: Updated the commit message.]
Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-08-21 14:56:49 +02:00
John Ogness
e35a888427 printk: Coordinate direct printing in panic
If legacy and nbcon consoles are registered and the nbcon
consoles are allowed to flush (i.e. no boot consoles
registered), the legacy consoles will no longer perform
direct printing on the panic CPU until after the backtrace
has been stored. This will give the safe nbcon consoles a
chance to print the panic messages before allowing the
unsafe legacy consoles to print.

If no nbcon consoles are registered or they are not allowed
to flush because boot consoles are registered, there is no
change in behavior (i.e. legacy consoles will always attempt
to print from the printk() caller context).

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240820063001.36405-30-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-08-21 14:56:25 +02:00
John Ogness
bebd87ae27 printk: Track nbcon consoles
Add a global flag @have_nbcon_console to identify if any nbcon
consoles are registered. This will be used in follow-up commits
to preserve legacy behavior when no nbcon consoles are registered.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240820063001.36405-29-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-08-21 14:56:25 +02:00
John Ogness
60013065fd printk: Avoid console_lock dance if no legacy or boot consoles
Currently the console lock is used to attempt legacy-type
printing even if there are no legacy or boot consoles registered.
If no such consoles are registered, the console lock does not
need to be taken.

Add tracking of legacy console registration and use it with
boot console tracking to avoid unnecessary code paths, i.e.
do not use the console lock if there are no boot consoles
and no legacy consoles.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240820063001.36405-28-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-08-21 14:56:24 +02:00
John Ogness
5dde3b7354 printk: nbcon: Add unsafe flushing on panic
Add nbcon_atomic_flush_unsafe() to flush all nbcon consoles
using the write_atomic() callback and allowing unsafe hostile
takeovers. Call this at the end of panic() as a final attempt
to flush any pending messages.

Note that legacy consoles use unsafe methods for flushing
from the beginning of panic (see bust_spinlocks()). Therefore,
systems using both legacy and nbcon consoles may still fail to
see panic messages due to unsafe legacy console usage.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240820063001.36405-27-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-08-21 14:56:24 +02:00
John Ogness
d2e85ca7a7 printk: Flush nbcon consoles first on panic
In console_flush_on_panic(), flush the nbcon consoles before
flushing legacy consoles. The legacy write() callbacks are not
fully safe when oops_in_progress is set.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240820063001.36405-26-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-08-21 14:56:24 +02:00
John Ogness
8ba77712a7 printk: nbcon: Flush new records on device_release()
There may be new records that were added while a driver was
holding the nbcon context for non-printing purposes. These
new records must be flushed by the nbcon_device_release()
context because no other context will do it.

If boot consoles are registered, the legacy loop is used
(either direct or per irq_work) to handle the flushing.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240820063001.36405-25-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-08-21 14:56:24 +02:00
John Ogness
70411bf8d2 printk: Add is_printk_legacy_deferred()
If printk has been explicitly deferred or is called from NMI
context, legacy console printing must be deferred to an irq_work
context. Introduce a helper function is_printk_legacy_deferred()
for a CPU to query if it must defer legacy console printing.

In follow-up commits this helper will be needed at other call
sites as well.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240820063001.36405-24-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-08-21 14:56:24 +02:00
John Ogness
c158834b22 printk: nbcon: Use nbcon consoles in console_flush_all()
Allow nbcon consoles to print messages in the legacy printk()
caller context (printing via unlock) by integrating them into
console_flush_all(). The write_atomic() callback is used for
printing.

Provide nbcon_legacy_emit_next_record(), which acts as the
nbcon variant of console_emit_next_record(). Call this variant
within console_flush_all() for nbcon consoles. Since nbcon
consoles use their own @nbcon_seq variable to track the next
record to print, this also must be appropriately handled in
console_flush_all().

Note that the legacy printing logic uses @handover to detect
handovers for printing all consoles. For nbcon consoles,
handovers/takeovers occur on a per-console basis and thus do
not cause the console_flush_all() loop to abort.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240820063001.36405-23-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-08-21 14:56:24 +02:00
John Ogness
97ea9bccfc printk: Track registered boot consoles
Unfortunately it is not known if a boot console and a regular
(legacy or nbcon) console use the same hardware. For this reason
they must not be allowed to print simultaneously.

For legacy consoles this is not an issue because they are
already synchronized with the boot consoles using the console
lock. However nbcon consoles can be triggered separately.

Add a global flag @have_boot_console to identify if any boot
consoles are registered. This will be used in follow-up commits
to ensure that boot consoles and nbcon consoles cannot print
simultaneously.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240820063001.36405-22-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-08-21 14:56:24 +02:00
Thomas Gleixner
d3a9f82ec5 printk: nbcon: Provide function to flush using write_atomic()
Provide nbcon_atomic_flush_pending() to perform flushing of all
registered nbcon consoles using their write_atomic() callback.

Unlike console_flush_all(), nbcon_atomic_flush_pending() will
only flush up through the newest record at the time of the
call. This prevents a CPU from printing unbounded when other
CPUs are adding records. If new records are added while
flushing, it is expected that the dedicated printer threads
will print those records. If the printer thread is not
available (which is always the case at this point in the
rework), nbcon_atomic_flush_pending() _will_ flush all records
in the ringbuffer.

Unlike console_flush_all(), nbcon_atomic_flush_pending() will
fully flush one console before flushing the next. This helps to
guarantee that a block of pending records (such as a stack
trace in an emergency situation) can be printed atomically at
once before releasing console ownership.

nbcon_atomic_flush_pending() is safe in any context because it
uses write_atomic() and acquires with unsafe_takeover disabled.

Co-developed-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: Thomas Gleixner (Intel) <tglx@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240820063001.36405-21-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-08-21 14:56:24 +02:00
John Ogness
06683a6649 printk: nbcon: Add helper to assign priority based on CPU state
Add a helper function to use the current state of the CPU to
determine which priority to assign to the printing context.

The EMERGENCY priority handling is added in a follow-up commit.
It will use a per-CPU variable.

Note: nbcon_device_try_acquire(), which is used by console
      drivers to acquire the nbcon console for non-printing
      activities, is hard-coded to always use NORMAL priority.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240820063001.36405-20-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-08-21 14:56:24 +02:00
John Ogness
fc400d5f63 printk: Add @flags argument for console_is_usable()
The caller of console_is_usable() usually needs @console->flags
for its own checks. Rather than having console_is_usable() read
its own copy, make the caller pass in the @flags. This also
ensures that the caller saw the same @flags value.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240820063001.36405-19-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-08-21 14:56:24 +02:00
John Ogness
20846d1ce2 printk: Let console_is_usable() handle nbcon
The nbcon consoles use a different printing callback. For nbcon
consoles, check for the write_atomic() callback instead of
write().

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240820063001.36405-18-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-08-21 14:56:24 +02:00
John Ogness
864c25c83d printk: Make console_is_usable() available to nbcon.c
Move console_is_usable() as-is into internal.h so that it can
be used by nbcon printing functions as well.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240820063001.36405-17-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-08-21 14:56:23 +02:00
John Ogness
1c17ebb790 printk: nbcon: Do not rely on proxy headers
The headers kernel.h, serial_core.h, and console.h allow for the
definitions of many types and functions from other headers.
Rather than relying on these as proxy headers, explicitly
include all headers providing needed definitions. Also sort the
list alphabetically to be able to easily detect duplicates.

Suggested-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Acked-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240820063001.36405-16-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-08-21 14:56:23 +02:00
John Ogness
adf6f37d14 nbcon: Add API to acquire context for non-printing operations
Provide functions nbcon_device_try_acquire() and
nbcon_device_release() which will try to acquire the nbcon
console ownership with NBCON_PRIO_NORMAL and mark it unsafe for
handover/takeover.

These functions are to be used together with the device-specific
locking when performing non-printing activities on the console
device. They will allow synchronization against the
atomic_write() callback which will be serialized, for higher
priority contexts, only by acquiring the console context
ownership.

Pitfalls:

The API requires to be called in a context with migration
disabled because it uses per-CPU variables internally.

The context is set unsafe for a takeover all the time. It
guarantees full serialization against any atomic_write() caller
except for the final flush in panic() which might try an unsafe
takeover.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240820063001.36405-14-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-08-21 14:56:23 +02:00
John Ogness
e55c3bcf38 printk: nbcon: Use driver synchronization while (un)registering
Console drivers typically have to deal with access to the
hardware via user input/output (such as an interactive login
shell) and output of kernel messages via printk() calls.

They use some classic driver-specific locking mechanism in most
situations. But console->write_atomic() callbacks, used by nbcon
consoles, are synchronized only by acquiring the console
context.

The synchronization via the console context ownership is possible
only when the console driver is registered. It is when a
particular device driver is connected with a particular console
driver.

The two synchronization mechanisms must be synchronized between
each other. It is tricky because the console context ownership
is quite special. It might be taken over by a higher priority
context. Also CPU migration must be disabled. The most tricky
part is to (dis)connect these two mechanisms during the console
(un)registration.

Use the driver-specific locking callbacks: device_lock(),
device_unlock(). They allow taking the device-specific lock
while the device is being (un)registered by the related console
driver.

For example, these callbacks lock/unlock the port lock for
serial port drivers.

Note that the driver-specific locking is only needed during
(un)register if it is an nbcon console with the write_atomic()
callback implemented. If write_atomic() is not implemented, the
driver should never attempt to access the hardware without
first acquiring its driver-specific lock.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240820063001.36405-10-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-08-21 14:56:23 +02:00
John Ogness
b7049d88c1 printk: nbcon: Remove return value for write_atomic()
The return value of write_atomic() does not provide any useful
information. On the contrary, it makes things more complicated
for the caller to appropriately deal with the information.

Change write_atomic() to not have a return value. If the
message did not get printed due to loss of ownership, the
caller will notice this on its own. If ownership was not lost,
it will be assumed that the driver successfully printed the
message and the sequence number for that console will be
incremented.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240820063001.36405-7-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-08-21 14:56:23 +02:00
John Ogness
8c9dab2c55 printk: nbcon: Clarify rules of the owner/waiter matching
The functions nbcon_owner_matches() and nbcon_waiter_matches()
use a minimal set of data to determine if a context matches.
The existing kerneldoc and comments were not clear enough and
caused the printk folks to re-prove that the functions are
indeed reliable in all cases.

Update and expand the explanations so that it is clear that the
implementations are sufficient for all cases.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240820063001.36405-6-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-08-21 14:56:22 +02:00
Sebastian Andrzej Siewior
0e1d5731d3 printk: Check printk_deferred_enter()/_exit() usage
Add validation that printk_deferred_enter()/_exit() are called in
non-migration contexts.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240820063001.36405-5-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-08-21 14:56:22 +02:00
Petr Mladek
d3ff380d47 printk: Properly deal with nbcon consoles on seq init
If a non-boot console is registering and boot consoles exist,
the consoles are flushed before being unregistered. This allows
the non-boot console to continue where the boot console left
off.

If for whatever reason flushing fails, the lowest seq found from
any of the enabled boot consoles is used. Until now con->seq was
checked. However, if it is an nbcon boot console, the function
nbcon_seq_read() must be used to read seq because con->seq is
not updated for nbcon consoles.

Check if it is an nbcon boot console and if so call
nbcon_seq_read() to read seq.

Also, avoid usage of con->seq as temporary storage of the
starting record. Instead, rename console_init_seq() to
get_init_console_seq() and just return the value. For nbcon
consoles set the sequence via nbcon_seq_force(), for legacy
consoles set con->seq.

The cleaned design should make sure that the value stays and is
set before the console is added to the console list. It also
unifies the sequence number initialization for legacy and nbcon
consoles.

Reviewed-by: John Ogness <john.ogness@linutronix.de>
Link: https://lore.kernel.org/r/20240820063001.36405-4-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-08-21 14:56:22 +02:00
John Ogness
f37b105fae printk: nbcon: Consolidate alloc() and init()
Rather than splitting the nbcon allocation and initialization into
two pieces, perform all initialization in nbcon_alloc(). Later,
the initial sequence is calculated and can be explicitly set using
nbcon_seq_force(). This removes the need for the strong rules of
nbcon_init() that even included a BUG_ON().

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240820063001.36405-3-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-08-21 14:56:22 +02:00
John Ogness
eda25860bf printk: Add notation to console_srcu locking
kernel/printk/printk.c:284:5: sparse: sparse: context imbalance in
'console_srcu_read_lock' - wrong count at exit
include/linux/srcu.h:301:9: sparse: sparse: context imbalance in
'console_srcu_read_unlock' - unexpected unlock

Fixes: 6c4afa7914 ("printk: Prepare for SRCU console list protection")
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/r/20240820063001.36405-2-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-08-21 14:56:22 +02:00
Linus Torvalds
b0da640826 Merge tag 'printk-for-6.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux
Pull printk fix from Petr Mladek:

 - Do not block printk on non-panic CPUs when they are dumping
   backtraces

* tag 'printk-for-6.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux:
  printk/panic: Allow cpu backtraces to be written into ringbuffer during panic
2024-08-19 09:26:35 -07:00
Linus Torvalds
c3f2d783a4 Merge tag 'mm-hotfixes-stable-2024-08-17-19-34' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
 "16 hotfixes. All except one are for MM. 10 of these are cc:stable and
  the others pertain to post-6.10 issues.

  As usual with these merges, singletons and doubletons all over the
  place, no identifiable-by-me theme. Please see the lovingly curated
  changelogs to get the skinny"

* tag 'mm-hotfixes-stable-2024-08-17-19-34' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  mm/migrate: fix deadlock in migrate_pages_batch() on large folios
  alloc_tag: mark pages reserved during CMA activation as not tagged
  alloc_tag: introduce clear_page_tag_ref() helper function
  crash: fix riscv64 crash memory reserve dead loop
  selftests: memfd_secret: don't build memfd_secret test on unsupported arches
  mm: fix endless reclaim on machines with unaccepted memory
  selftests/mm: compaction_test: fix off by one in check_compaction()
  mm/numa: no task_numa_fault() call if PMD is changed
  mm/numa: no task_numa_fault() call if PTE is changed
  mm/vmalloc: fix page mapping if vm_area_alloc_pages() with high order fallback to order 0
  mm/memory-failure: use raw_spinlock_t in struct memory_failure_cpu
  mm: don't account memmap per-node
  mm: add system wide stats items category
  mm: don't account memmap on failure
  mm/hugetlb: fix hugetlb vs. core-mm PT locking
  mseal: fix is_madv_discard()
2024-08-17 19:50:16 -07:00
Linus Torvalds
810996a363 Merge tag 'powerpc-6.11-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:

 - Fix crashes on 85xx with some configs since the recent hugepd rework.

 - Fix boot warning with hugepages and CONFIG_DEBUG_VIRTUAL on some
   platforms.

 - Don't enable offline cores when changing SMT modes, to match existing
   userspace behaviour.

Thanks to Christophe Leroy, Dr. David Alan Gilbert, Guenter Roeck, Nysal
Jan K.A, Shrikanth Hegde, Thomas Gleixner, and Tyrel Datwyler.

* tag 'powerpc-6.11-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/topology: Check if a core is online
  cpu/SMT: Enable SMT only if a core is online
  powerpc/mm: Fix boot warning with hugepages and CONFIG_DEBUG_VIRTUAL
  powerpc/mm: Fix size of allocated PGDIR
  soc: fsl: qbman: remove unused struct 'cgr_comp'
2024-08-17 19:23:02 -07:00
Linus Torvalds
4a621e2910 Merge tag 'trace-v6.11-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
 "A couple of fixes for tracing:

   - Prevent a NULL pointer dereference in the error path of RTLA tool

   - Fix an infinite loop bug when reading from the ring buffer when
     closed. If there's a thread trying to read the ring buffer and it
     gets closed by another thread, the one reading will go into an
     infinite loop when the buffer is empty instead of exiting back to
     user space"

* tag 'trace-v6.11-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  rtla/osnoise: Prevent NULL dereference in error handling
  tracing: Return from tracing_buffers_read() if the file has been closed
2024-08-16 11:12:29 -07:00
Jinjie Ruan
edb907a613 crash: fix riscv64 crash memory reserve dead loop
On RISCV64 Qemu machine with 512MB memory, cmdline "crashkernel=500M,high"
will cause system stall as below:

	 Zone ranges:
	   DMA32    [mem 0x0000000080000000-0x000000009fffffff]
	   Normal   empty
	 Movable zone start for each node
	 Early memory node ranges
	   node   0: [mem 0x0000000080000000-0x000000008005ffff]
	   node   0: [mem 0x0000000080060000-0x000000009fffffff]
	 Initmem setup node 0 [mem 0x0000000080000000-0x000000009fffffff]
	(stall here)

commit 5d99cadf1568 ("crash: fix x86_32 crash memory reserve dead loop
bug") fix this on 32-bit architecture.  However, the problem is not
completely solved.  If `CRASH_ADDR_LOW_MAX = CRASH_ADDR_HIGH_MAX` on
64-bit architecture, for example, when system memory is equal to
CRASH_ADDR_LOW_MAX on RISCV64, the following infinite loop will also
occur:

	-> reserve_crashkernel_generic() and high is true
	   -> alloc at [CRASH_ADDR_LOW_MAX, CRASH_ADDR_HIGH_MAX] fail
	      -> alloc at [0, CRASH_ADDR_LOW_MAX] fail and repeatedly
	         (because CRASH_ADDR_LOW_MAX = CRASH_ADDR_HIGH_MAX).

As Catalin suggested, do not remove the ",high" reservation fallback to
",low" logic which will change arm64's kdump behavior, but fix it by
skipping the above situation similar to commit d2f32f23190b ("crash: fix
x86_32 crash memory reserve dead loop").

After this patch, it print:
	cannot allocate crashkernel (size:0x1f400000)

Link: https://lkml.kernel.org/r/20240812062017.2674441-1-ruanjinjie@huawei.com
Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com>
Suggested-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Dave Young <dyoung@redhat.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-08-15 22:16:16 -07:00
Linus Torvalds
e724918b37 Merge tag 'hardening-v6.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull hardening fixes from Kees Cook:

 - gcc-plugins: randstruct: Remove GCC 4.7 or newer requirement
   (Thorsten Blum)

 - kallsyms: Clean up interaction with LTO suffixes (Song Liu)

 - refcount: Report UAF for refcount_sub_and_test(0) when counter==0
   (Petr Pavlu)

 - kunit/overflow: Avoid misallocation of driver name (Ivan Orlov)

* tag 'hardening-v6.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  kallsyms: Match symbols exactly with CONFIG_LTO_CLANG
  kallsyms: Do not cleanup .llvm.<hash> suffix before sorting symbols
  kunit/overflow: Fix UB in overflow_allocation_test
  gcc-plugins: randstruct: Remove GCC 4.7 or newer requirement
  refcount: Report UAF for refcount_sub_and_test(0) when counter==0
2024-08-15 11:50:07 -07:00
Song Liu
fb6a421fb6 kallsyms: Match symbols exactly with CONFIG_LTO_CLANG
With CONFIG_LTO_CLANG=y, the compiler may add .llvm.<hash> suffix to
function names to avoid duplication. APIs like kallsyms_lookup_name()
and kallsyms_on_each_match_symbol() tries to match these symbol names
without the .llvm.<hash> suffix, e.g., match "c_stop" with symbol
c_stop.llvm.17132674095431275852. This turned out to be problematic
for use cases that require exact match, for example, livepatch.

Fix this by making the APIs to match symbols exactly.

Also cleanup kallsyms_selftests accordingly.

Signed-off-by: Song Liu <song@kernel.org>
Fixes: 8cc32a9bbf ("kallsyms: strip LTO-only suffixes from promoted global functions")
Tested-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Link: https://lore.kernel.org/r/20240807220513.3100483-3-song@kernel.org
Signed-off-by: Kees Cook <kees@kernel.org>
2024-08-15 09:33:35 -07:00
Linus Torvalds
4ac0f08f44 Merge tag 'vfs-6.11-rc4.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs fixes from Christian Brauner:
 "VFS:

   - Fix the name of file lease slab cache. When file leases were split
     out of file locks the name of the file lock slab cache was used for
     the file leases slab cache as well.

   - Fix a type in take_fd() helper.

   - Fix infinite directory iteration for stable offsets in tmpfs.

   - When the icache is pruned all reclaimable inodes are marked with
     I_FREEING and other processes that try to lookup such inodes will
     block.

     But some filesystems like ext4 can trigger lookups in their inode
     evict callback causing deadlocks. Ext4 does such lookups if the
     ea_inode feature is used whereby a separate inode may be used to
     store xattrs.

     Introduce I_LRU_ISOLATING which pins the inode while its pages are
     reclaimed. This avoids inode deletion during inode_lru_isolate()
     avoiding the deadlock and evict is made to wait until
     I_LRU_ISOLATING is done.

  netfs:

   - Fault in smaller chunks for non-large folio mappings for
     filesystems that haven't been converted to large folios yet.

   - Fix the CONFIG_NETFS_DEBUG config option. The config option was
     renamed a short while ago and that introduced two minor issues.
     First, it depended on CONFIG_NETFS whereas it wants to depend on
     CONFIG_NETFS_SUPPORT. The former doesn't exist, while the latter
     does. Second, the documentation for the config option wasn't fixed
     up.

   - Revert the removal of the PG_private_2 writeback flag as ceph is
     using it and fix how that flag is handled in netfs.

   - Fix DIO reads on 9p. A program watching a file on a 9p mount
     wouldn't see any changes in the size of the file being exported by
     the server if the file was changed directly in the source
     filesystem. Fix this by attempting to read the full size specified
     when a DIO read is requested.

   - Fix a NULL pointer dereference bug due to a data race where a
     cachefiles cookies was retired even though it was still in use.
     Check the cookie's n_accesses counter before discarding it.

  nsfs:

   - Fix ioctl declaration for NS_GET_MNTNS_ID from _IO() to _IOR() as
     the kernel is writing to userspace.

  pidfs:

   - Prevent the creation of pidfds for kthreads until we have a
     use-case for it and we know the semantics we want. It also confuses
     userspace why they can get pidfds for kthreads.

  squashfs:

   - Fix an unitialized value bug reported by KMSAN caused by a
     corrupted symbolic link size read from disk. Check that the
     symbolic link size is not larger than expected"

* tag 'vfs-6.11-rc4.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  Squashfs: sanity check symbolic link size
  9p: Fix DIO read through netfs
  vfs: Don't evict inode under the inode lru traversing context
  netfs: Fix handling of USE_PGPRIV2 and WRITE_TO_CACHE flags
  netfs, ceph: Revert "netfs: Remove deprecated use of PG_private_2 as a second writeback flag"
  file: fix typo in take_fd() comment
  pidfd: prevent creation of pidfds for kthreads
  netfs: clean up after renaming FSCACHE_DEBUG config
  libfs: fix infinite directory reads for offset dir
  nsfs: fix ioctl declaration
  fs/netfs/fscache_cookie: add missing "n_accesses" check
  filelock: fix name of file_lease slab cache
  netfs: Fault in smaller chunks for non-large folio mappings
2024-08-14 09:06:28 -07:00
Kyle Huey
100bff2381 perf/bpf: Don't call bpf_overflow_handler() for tracing events
The regressing commit is new in 6.10. It assumed that anytime event->prog
is set bpf_overflow_handler() should be invoked to execute the attached bpf
program. This assumption is false for tracing events, and as a result the
regressing commit broke bpftrace by invoking the bpf handler with garbage
inputs on overflow.

Prior to the regression the overflow handlers formed a chain (of length 0,
1, or 2) and perf_event_set_bpf_handler() (the !tracing case) added
bpf_overflow_handler() to that chain, while perf_event_attach_bpf_prog()
(the tracing case) did not. Both set event->prog. The chain of overflow
handlers was replaced by a single overflow handler slot and a fixed call to
bpf_overflow_handler() when appropriate. This modifies the condition there
to check event->prog->type == BPF_PROG_TYPE_PERF_EVENT, restoring the
previous behavior and fixing bpftrace.

Signed-off-by: Kyle Huey <khuey@kylehuey.com>
Suggested-by: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Reported-by: Joe Damato <jdamato@fastly.com>
Closes: https://lore.kernel.org/lkml/ZpFfocvyF3KHaSzF@LQ3V64L9R2/
Fixes: f11f10bfa1 ("perf/bpf: Call BPF handler directly, not through overflow machinery")
Cc: stable@vger.kernel.org
Tested-by: Joe Damato <jdamato@fastly.com> # bpftrace
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240813151727.28797-1-jdamato@fastly.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-08-13 10:25:28 -07:00
Ryo Takakura
bcc954c6ca printk/panic: Allow cpu backtraces to be written into ringbuffer during panic
commit 779dbc2e78 ("printk: Avoid non-panic CPUs writing
to ringbuffer") disabled non-panic CPUs to further write messages to
ringbuffer after panicked.

Since the commit, non-panicked CPU's are not allowed to write to
ring buffer after panicked and CPU backtrace which is triggered
after panicked to sample non-panicked CPUs' backtrace no longer
serves its function as it has nothing to print.

Fix the issue by allowing non-panicked CPUs to write into ringbuffer
while CPU backtrace is in flight.

Fixes: 779dbc2e78 ("printk: Avoid non-panic CPUs writing to ringbuffer")
Signed-off-by: Ryo Takakura <takakura@valinux.co.jp>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240812072703.339690-1-takakura@valinux.co.jp
Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-08-13 14:16:22 +02:00
Yonghong Song
bed2eb964c bpf: Fix a kernel verifier crash in stacksafe()
Daniel Hodges reported a kernel verifier crash when playing with sched-ext.
Further investigation shows that the crash is due to invalid memory access
in stacksafe(). More specifically, it is the following code:

    if (exact != NOT_EXACT &&
        old->stack[spi].slot_type[i % BPF_REG_SIZE] !=
        cur->stack[spi].slot_type[i % BPF_REG_SIZE])
            return false;

The 'i' iterates old->allocated_stack.
If cur->allocated_stack < old->allocated_stack the out-of-bound
access will happen.

To fix the issue add 'i >= cur->allocated_stack' check such that if
the condition is true, stacksafe() should fail. Otherwise,
cur->stack[spi].slot_type[i % BPF_REG_SIZE] memory access is legal.

Fixes: 2793a8b015 ("bpf: exact states comparison for iterator convergence checks")
Cc: Eduard Zingerman <eddyz87@gmail.com>
Reported-by: Daniel Hodges <hodgesd@meta.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20240812214847.213612-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-08-12 18:09:48 -07:00
Nysal Jan K.A
6c17ea1f3e cpu/SMT: Enable SMT only if a core is online
If a core is offline then enabling SMT should not online CPUs of
this core. By enabling SMT, what is intended is either changing the SMT
value from "off" to "on" or setting the SMT level (threads per core) from a
lower to higher value.

On PowerPC the ppc64_cpu utility can be used, among other things, to
perform the following functions:

ppc64_cpu --cores-on                # Get the number of online cores
ppc64_cpu --cores-on=X              # Put exactly X cores online
ppc64_cpu --offline-cores=X[,Y,...] # Put specified cores offline
ppc64_cpu --smt={on|off|value}      # Enable, disable or change SMT level

If the user has decided to offline certain cores, enabling SMT should
not online CPUs in those cores. This patch fixes the issue and changes
the behaviour as described, by introducing an arch specific function
topology_is_core_online(). It is currently implemented only for PowerPC.

Fixes: 73c58e7e14 ("powerpc: Add HOTPLUG_SMT support")
Reported-by: Tyrel Datwyler <tyreld@linux.ibm.com>
Closes: https://groups.google.com/g/powerpc-utils-devel/c/wrwVzAAnRlI/m/5KJSoqP4BAAJ
Signed-off-by: Nysal Jan K.A <nysal@linux.ibm.com>
Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/20240731030126.956210-2-nysal@linux.ibm.com
2024-08-13 10:31:24 +10:00
Christian Brauner
3b5bbe798b pidfd: prevent creation of pidfds for kthreads
It's currently possible to create pidfds for kthreads but it is unclear
what that is supposed to mean. Until we have use-cases for it and we
figured out what behavior we want block the creation of pidfds for
kthreads.

Link: https://lore.kernel.org/r/20240731-gleis-mehreinnahmen-6bbadd128383@brauner
Fixes: 32fcb426ec ("pid: add pidfd_open()")
Cc: stable@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-12 22:03:26 +02:00
Linus Torvalds
7270e931b5 Merge tag 'timers-urgent-2024-08-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull time keeping fixes from Thomas Gleixner:

 - Fix a couple of issues in the NTP code where user supplied values are
   neither sanity checked nor clamped to the operating range. This
   results in integer overflows and eventualy NTP getting out of sync.

   According to the history the sanity checks had been removed in favor
   of clamping the values, but the clamping never worked correctly under
   all circumstances. The NTP people asked to not bring the sanity
   checks back as it might break existing applications.

   Make the clamping work correctly and add it where it's missing

 - If adjtimex() sets the clock it has to trigger the hrtimer subsystem
   so it can adjust and if the clock was set into the future expire
   timers if needed. The caller should provide a bitmask to tell
   hrtimers which clocks have been adjusted.

   adjtimex() uses not the proper constant and uses CLOCK_REALTIME
   instead, which is 0. So hrtimers adjusts only the clocks, but does
   not check for expired timers, which might make them expire really
   late. Use the proper bitmask constant instead.

* tag 'timers-urgent-2024-08-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  timekeeping: Fix bogus clock_was_set() invocation in do_adjtimex()
  ntp: Safeguard against time_constant overflow
  ntp: Clamp maxerror and esterror to operating range
2024-08-11 10:15:34 -07:00
Linus Torvalds
56fe0a6a9f Merge tag 'irq-urgent-2024-08-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fixes from Thomas Gleixner:
 "Three small fixes for interrupt core and drivers:

   - The interrupt core fails to honor caller supplied affinity hints
     for non-managed interrupts and uses the system default affinity on
     startup instead. Set the missing flag in the descriptor to tell the
     core to use the provided affinity.

   - Fix a shift out of bounds error in the Xilinx driver

   - Handle switching to level trigger correctly in the RISCV APLIC
     driver. It failed to retrigger the interrupt which causes it to
     become stale"

* tag 'irq-urgent-2024-08-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  irqchip/riscv-aplic: Retrigger MSI interrupt on source configuration
  irqchip/xilinx: Fix shift out of bounds
  genirq/irqdesc: Honor caller provided affinity in alloc_desc()
2024-08-11 10:07:52 -07:00
Linus Torvalds
0409cc53c4 Merge tag 'dma-mapping-6.11-2024-08-10' of git://git.infradead.org/users/hch/dma-mapping
Pull dma-mapping fix from Christoph Hellwig:

 - avoid a deadlock with dma-debug and netconsole (Rik van Riel)

* tag 'dma-mapping-6.11-2024-08-10' of git://git.infradead.org/users/hch/dma-mapping:
  dma-debug: avoid deadlock between dma debug vs printk and netconsole
2024-08-10 10:19:05 -07:00
Steven Rostedt
d0949cd44a tracing: Return from tracing_buffers_read() if the file has been closed
When running the following:

 # cd /sys/kernel/tracing/
 # echo 1 > events/sched/sched_waking/enable
 # echo 1 > events/sched/sched_switch/enable
 # echo 0 > tracing_on
 # dd if=per_cpu/cpu0/trace_pipe_raw of=/tmp/raw0.dat

The dd task would get stuck in an infinite loop in the kernel. What would
happen is the following:

When ring_buffer_read_page() returns -1 (no data) then a check is made to
see if the buffer is empty (as happens when the page is not full), it will
call wait_on_pipe() to wait until the ring buffer has data. When it is it
will try again to read data (unless O_NONBLOCK is set).

The issue happens when there's a reader and the file descriptor is closed.
The wait_on_pipe() will return when that is the case. But this loop will
continue to try again and wait_on_pipe() will again return immediately and
the loop will continue and never stop.

Simply check if the file was closed before looping and exit out if it is.

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20240808235730.78bf63e5@rorschach.local.home
Fixes: 2aa043a55b ("tracing/ring-buffer: Fix wait_on_pipe() race")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-08-09 12:59:35 -04:00
Linus Torvalds
146430a0c2 Merge tag 'probes-fixes-v6.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull kprobe fixes from Masami Hiramatsu:

 - Fix misusing str_has_prefix() parameter order to check symbol prefix
   correctly

 - bpf: remove unused declaring of bpf_kprobe_override

* tag 'probes-fixes-v6.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  kprobes: Fix to check symbol prefixes correctly
  bpf: kprobe: remove unused declaring of bpf_kprobe_override
2024-08-09 09:43:46 -07:00
Linus Torvalds
2124d84db2 module: make waiting for a concurrent module loader interruptible
The recursive aes-arm-bs module load situation reported by Russell King
is getting fixed in the crypto layer, but this in the meantime fixes the
"recursive load hangs forever" by just making the waiting for the first
module load be interruptible.

This should now match the old behavior before commit 9b9879fc03
("modules: catch concurrent module loads, treat them as idempotent"),
which used the different "wait for module to be ready" code in
module_patient_check_exists().

End result: a recursive module load will still block, but now a signal
will interrupt it and fail the second module load, at which point the
first module will successfully complete loading.

Fixes: 9b9879fc03 ("modules: catch concurrent module loads, treat them as idempotent")
Cc: Russell King <linux@armlinux.org.uk>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-08-09 08:33:28 -07:00
Linus Torvalds
9466b6ae6b Merge tag 'trace-v6.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:

 - Have reading of event format files test if the metadata still exists.

   When a event is freed, a flag (EVENT_FILE_FL_FREED) in the metadata
   is set to state that it is to prevent any new references to it from
   happening while waiting for existing references to close. When the
   last reference closes, the metadata is freed. But the "format" was
   missing a check to this flag (along with some other files) that
   allowed new references to happen, and a use-after-free bug to occur.

 - Have the trace event meta data use the refcount infrastructure
   instead of relying on its own atomic counters.

 - Have tracefs inodes use alloc_inode_sb() for allocation instead of
   using kmem_cache_alloc() directly.

 - Have eventfs_create_dir() return an ERR_PTR instead of NULL as the
   callers expect a real object or an ERR_PTR.

 - Have release_ei() use call_srcu() and not call_rcu() as all the
   protection is on SRCU and not RCU.

 - Fix ftrace_graph_ret_addr() to use the task passed in and not
   current.

 - Fix overflow bug in get_free_elt() where the counter can overflow the
   integer and cause an infinite loop.

 - Remove unused function ring_buffer_nr_pages()

 - Have tracefs freeing use the inode RCU infrastructure instead of
   creating its own.

   When the kernel had randomize structure fields enabled, the rcu field
   of the tracefs_inode was overlapping the rcu field of the inode
   structure, and corrupting it. Instead, use the destroy_inode()
   callback to do the initial cleanup of the code, and then have
   free_inode() free it.

* tag 'trace-v6.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracefs: Use generic inode RCU for synchronizing freeing
  ring-buffer: Remove unused function ring_buffer_nr_pages()
  tracing: Fix overflow in get_free_elt()
  function_graph: Fix the ret_stack used by ftrace_graph_ret_addr()
  eventfs: Use SRCU for freeing eventfs_inodes
  eventfs: Don't return NULL in eventfs_create_dir()
  tracefs: Fix inode allocation
  tracing: Use refcount for trace_event_file reference counter
  tracing: Have format file honor EVENT_FILE_FL_FREED
2024-08-08 13:32:59 -07:00
Linus Torvalds
b3f5620f76 Merge tag 'bcachefs-2024-08-08' of git://evilpiepirate.org/bcachefs
Pull bcachefs fixes from Kent Overstreet:
 "Assorted little stuff:

   - lockdep fixup for lockdep_set_notrack_class()

   - we can now remove a device when using erasure coding without
     deadlocking, though we still hit other issues

   - the 'allocator stuck' timeout is now configurable, and messages are
     ratelimited. The default timeout has been increased from 10 seconds
     to 30"

* tag 'bcachefs-2024-08-08' of git://evilpiepirate.org/bcachefs:
  bcachefs: Use bch2_wait_on_allocator() in btree node alloc path
  bcachefs: Make allocator stuck timeout configurable, ratelimit messages
  bcachefs: Add missing path_traverse() to btree_iter_next_node()
  bcachefs: ec should not allocate from ro devs
  bcachefs: Improved allocator debugging for ec
  bcachefs: Add missing bch2_trans_begin() call
  bcachefs: Add a comment for bucket helper types
  bcachefs: Don't rely on implicit unsigned -> signed integer conversion
  lockdep: Fix lockdep_set_notrack_class() for CONFIG_LOCK_STAT
  bcachefs: Fix double free of ca->buckets_nouse
2024-08-08 13:27:31 -07:00
Linus Torvalds
cb5b81bc9a module: warn about excessively long module waits
Russell King reported that the arm cbc(aes) crypto module hangs when
loaded, and Herbert Xu bisected it to commit 9b9879fc03 ("modules:
catch concurrent module loads, treat them as idempotent"), and noted:

 "So what's happening here is that the first modprobe tries to load a
  fallback CBC implementation, in doing so it triggers a load of the
  exact same module due to module aliases.

  IOW we're loading aes-arm-bs which provides cbc(aes). However, this
  needs a fallback of cbc(aes) to operate, which is made out of the
  generic cbc module + any implementation of aes, or ecb(aes). The
  latter happens to also be provided by aes-arm-cb so that's why it
  tries to load the same module again"

So loading the aes-arm-bs module ends up wanting to recursively load
itself, and the recursive load then ends up waiting for the original
module load to complete.

This is a regression, in that it used to be that we just tried to load
the module multiple times, and then as we went on to install it the
second time we would instead just error out because the module name
already existed.

That is actually also exactly what the original "catch concurrent loads"
patch did in commit 9828ed3f69 ("module: error out early on concurrent
load of the same module file"), but it turns out that it ends up being
racy, in that erroring out before the module has been fully initialized
will cause failures in dependent module loading.

See commit ac2263b588 (which was the revert of that "error out early")
commit for details about why erroring out before the module has been
initialized is actually fundamentally racy.

Now, for the actual recursive module load (as opposed to just
concurrently loading the same module twice), the race is not an issue.

At the same time it's hard for the kernel to see that this is recursion,
because the module load is always done from a usermode helper, so the
recursion is not some simple callchain within the kernel.

End result: this is not the real fix, but this at least adds a warning
for the situation (admittedly much too late for all the debugging pain
that Russell and Herbert went through) and if we can come to a
resolution on how to detect the recursion properly, this re-organizes
the code to make that easier.

Link: https://lore.kernel.org/all/ZrFHLqvFqhzykuYw@shell.armlinux.org.uk/
Reported-by: Russell King <linux@armlinux.org.uk>
Debugged-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-08-08 12:29:40 -07:00
Linus Torvalds
660e4b18a7 Merge tag 'mm-hotfixes-stable-2024-08-07-18-32' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
 "Nine hotfixes. Five are cc:stable, the others either pertain to
  post-6.10 material or aren't considered necessary for earlier kernels.

  Five are MM and four are non-MM. No identifiable theme here - please
  see the individual changelogs"

* tag 'mm-hotfixes-stable-2024-08-07-18-32' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  padata: Fix possible divide-by-0 panic in padata_mt_helper()
  mailmap: update entry for David Heidelberg
  memcg: protect concurrent access to mem_cgroup_idr
  mm: shmem: fix incorrect aligned index when checking conflicts
  mm: shmem: avoid allocating huge pages larger than MAX_PAGECACHE_ORDER for shmem
  mm: list_lru: fix UAF for memory cgroup
  kcov: properly check for softirq context
  MAINTAINERS: Update LTP members and web
  selftests: mm: add s390 to ARCH check
2024-08-08 07:32:20 -07:00
Waiman Long
6d45e1c948 padata: Fix possible divide-by-0 panic in padata_mt_helper()
We are hit with a not easily reproducible divide-by-0 panic in padata.c at
bootup time.

  [   10.017908] Oops: divide error: 0000 1 PREEMPT SMP NOPTI
  [   10.017908] CPU: 26 PID: 2627 Comm: kworker/u1666:1 Not tainted 6.10.0-15.el10.x86_64 #1
  [   10.017908] Hardware name: Lenovo ThinkSystem SR950 [7X12CTO1WW]/[7X12CTO1WW], BIOS [PSE140J-2.30] 07/20/2021
  [   10.017908] Workqueue: events_unbound padata_mt_helper
  [   10.017908] RIP: 0010:padata_mt_helper+0x39/0xb0
    :
  [   10.017963] Call Trace:
  [   10.017968]  <TASK>
  [   10.018004]  ? padata_mt_helper+0x39/0xb0
  [   10.018084]  process_one_work+0x174/0x330
  [   10.018093]  worker_thread+0x266/0x3a0
  [   10.018111]  kthread+0xcf/0x100
  [   10.018124]  ret_from_fork+0x31/0x50
  [   10.018138]  ret_from_fork_asm+0x1a/0x30
  [   10.018147]  </TASK>

Looking at the padata_mt_helper() function, the only way a divide-by-0
panic can happen is when ps->chunk_size is 0.  The way that chunk_size is
initialized in padata_do_multithreaded(), chunk_size can be 0 when the
min_chunk in the passed-in padata_mt_job structure is 0.

Fix this divide-by-0 panic by making sure that chunk_size will be at least
1 no matter what the input parameters are.

Link: https://lkml.kernel.org/r/20240806174647.1050398-1-longman@redhat.com
Fixes: 004ed42638 ("padata: add basic support for multithreaded jobs")
Signed-off-by: Waiman Long <longman@redhat.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Waiman Long <longman@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-08-07 18:33:56 -07:00
Andrey Konovalov
7d4df2dad3 kcov: properly check for softirq context
When collecting coverage from softirqs, KCOV uses in_serving_softirq() to
check whether the code is running in the softirq context.  Unfortunately,
in_serving_softirq() is > 0 even when the code is running in the hardirq
or NMI context for hardirqs and NMIs that happened during a softirq.

As a result, if a softirq handler contains a remote coverage collection
section and a hardirq with another remote coverage collection section
happens during handling the softirq, KCOV incorrectly detects a nested
softirq coverate collection section and prints a WARNING, as reported by
syzbot.

This issue was exposed by commit a7f3813e58 ("usb: gadget: dummy_hcd:
Switch to hrtimer transfer scheduler"), which switched dummy_hcd to using
hrtimer and made the timer's callback be executed in the hardirq context.

Change the related checks in KCOV to account for this behavior of
in_serving_softirq() and make KCOV ignore remote coverage collection
sections in the hardirq and NMI contexts.

This prevents the WARNING printed by syzbot but does not fix the inability
of KCOV to collect coverage from the __usb_hcd_giveback_urb when dummy_hcd
is in use (caused by a7f3813e58); a separate patch is required for that.

Link: https://lkml.kernel.org/r/20240729022158.92059-1-andrey.konovalov@linux.dev
Fixes: 5ff3b30ab5 ("kcov: collect coverage from interrupts")
Signed-off-by: Andrey Konovalov <andreyknvl@gmail.com>
Reported-by: syzbot+2388cdaeb6b10f0c13ac@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=2388cdaeb6b10f0c13ac
Acked-by: Marco Elver <elver@google.com>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Aleksandr Nogikh <nogikh@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Marcello Sylvester Bauer <sylv@sylv.io>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-08-07 18:33:56 -07:00