Commit Graph

45535 Commits

Author SHA1 Message Date
Linus Torvalds
9f0c253ddd Merge tag 'perf-core-2024-09-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf events updates from Ingo Molnar:

 - Implement per-PMU context rescheduling to significantly improve
   single-PMU performance, and related cleanups/fixes (Peter Zijlstra
   and Namhyung Kim)

 - Fix ancient bug resulting in a lot of events being dropped
   erroneously at higher sampling frequencies (Luo Gengkun)

 - uprobes enhancements:

     - Implement RCU-protected hot path optimizations for better
       performance:

         "For baseline vs SRCU, peak througput increased from 3.7 M/s
          (million uprobe triggerings per second) up to about 8 M/s. For
          uretprobes it's a bit more modest with bump from 2.4 M/s to
          5 M/s.

          For SRCU vs RCU Tasks Trace, peak throughput for uprobes
          increases further from 8 M/s to 10.3 M/s (+28%!), and for
          uretprobes from 5.3 M/s to 5.8 M/s (+11%), as we have more
          work to do on uretprobes side.

          Even single-thread (no contention) performance is slightly
          better: 3.276 M/s to 3.396 M/s (+3.5%) for uprobes, and 2.055
          M/s to 2.174 M/s (+5.8%) for uretprobes."

          (Andrii Nakryiko et al)

     - Document mmap_lock, don't abuse get_user_pages_remote() (Oleg
       Nesterov)

     - Cleanups & fixes to prepare for future work:
        - Remove uprobe_register_refctr()
	- Simplify error handling for alloc_uprobe()
        - Make uprobe_register() return struct uprobe *
        - Fold __uprobe_unregister() into uprobe_unregister()
        - Shift put_uprobe() from delete_uprobe() to uprobe_unregister()
        - BPF: Fix use-after-free in bpf_uprobe_multi_link_attach()
          (Oleg Nesterov)

 - New feature & ABI extension: allow events to use PERF_SAMPLE READ
   with inheritance, enabling sample based profiling of a group of
   counters over a hierarchy of processes or threads (Ben Gainey)

 - Intel uncore & power events updates:

      - Add Arrow Lake and Lunar Lake support
      - Add PERF_EV_CAP_READ_SCOPE
      - Clean up and enhance cpumask and hotplug support
        (Kan Liang)

      - Add LNL uncore iMC freerunning support
      - Use D0:F0 as a default device
        (Zhenyu Wang)

 - Intel PT: fix AUX snapshot handling race (Adrian Hunter)

 - Misc fixes and cleanups (James Clark, Jiri Olsa, Oleg Nesterov and
   Peter Zijlstra)

* tag 'perf-core-2024-09-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (40 commits)
  dmaengine: idxd: Clean up cpumask and hotplug for perfmon
  iommu/vt-d: Clean up cpumask and hotplug for perfmon
  perf/x86/intel/cstate: Clean up cpumask and hotplug
  perf: Add PERF_EV_CAP_READ_SCOPE
  perf: Generic hotplug support for a PMU with a scope
  uprobes: perform lockless SRCU-protected uprobes_tree lookup
  rbtree: provide rb_find_rcu() / rb_find_add_rcu()
  perf/uprobe: split uprobe_unregister()
  uprobes: travers uprobe's consumer list locklessly under SRCU protection
  uprobes: get rid of enum uprobe_filter_ctx in uprobe filter callbacks
  uprobes: protected uprobe lifetime with SRCU
  uprobes: revamp uprobe refcounting and lifetime management
  bpf: Fix use-after-free in bpf_uprobe_multi_link_attach()
  perf/core: Fix small negative period being ignored
  perf: Really fix event_function_call() locking
  perf: Optimize __pmu_ctx_sched_out()
  perf: Add context time freeze
  perf: Fix event_function_call() locking
  perf: Extract a few helpers
  perf: Optimize context reschedule for single PMU cases
  ...
2024-09-18 15:03:58 +02:00
Linus Torvalds
667495de21 Merge tag 'execve-v6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull execve updates from Kees Cook:

 - binfmt_elf: Dump smaller VMAs first in ELF cores (Brian Mak)

 - binfmt_elf: mseal address zero (Jeff Xu)

 - binfmt_elf, coredump: Log the reason of the failed core dumps (Roman
   Kisel)

* tag 'execve-v6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  binfmt_elf: mseal address zero
  binfmt_elf: Dump smaller VMAs first in ELF cores
  binfmt_elf, coredump: Log the reason of the failed core dumps
  coredump: Standartize and fix logging
2024-09-18 11:53:31 +02:00
Linus Torvalds
bdf56c7580 Merge tag 'slab-for-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab
Pull slab updates from Vlastimil Babka:
 "This time it's mostly refactoring and improving APIs for slab users in
  the kernel, along with some debugging improvements.

   - kmem_cache_create() refactoring (Christian Brauner)

     Over the years have been growing new parameters to
     kmem_cache_create() where most of them are needed only for a small
     number of caches - most recently the rcu_freeptr_offset parameter.

     To avoid adding new parameters to kmem_cache_create() and adjusting
     all its callers, or creating new wrappers such as
     kmem_cache_create_rcu(), we can now pass extra parameters using the
     new struct kmem_cache_args. Not explicitly initialized fields
     default to values interpreted as unused.

     kmem_cache_create() is for now a wrapper that works both with the
     new form: kmem_cache_create(name, object_size, args, flags) and the
     legacy form: kmem_cache_create(name, object_size, align, flags,
     ctor)

   - kmem_cache_destroy() waits for kfree_rcu()'s in flight (Vlastimil
     Babka, Uladislau Rezki)

     Since SLOB removal, kfree() is allowed for freeing objects
     allocated by kmem_cache_create(). By extension kfree_rcu() as
     allowed as well, which can allow converting simple call_rcu()
     callbacks that only do kmem_cache_free(), as there was never a
     kmem_cache_free_rcu() variant. However, for caches that can be
     destroyed e.g. on module removal, the cache owners knew to issue
     rcu_barrier() first to wait for the pending call_rcu()'s, and this
     is not sufficient for pending kfree_rcu()'s due to its internal
     batching optimizations. Ulad has provided a new
     kvfree_rcu_barrier() and to make the usage less error-prone,
     kmem_cache_destroy() calls it. Additionally, destroying
     SLAB_TYPESAFE_BY_RCU caches now again issues rcu_barrier()
     synchronously instead of using an async work, because the past
     motivation for async work no longer applies. Users of custom
     call_rcu() callbacks should however keep calling rcu_barrier()
     before cache destruction.

   - Debugging use-after-free in SLAB_TYPESAFE_BY_RCU caches (Jann Horn)

     Currently, KASAN cannot catch UAFs in such caches as it is legal to
     access them within a grace period, and we only track the grace
     period when trying to free the underlying slab page. The new
     CONFIG_SLUB_RCU_DEBUG option changes the freeing of individual
     object to be RCU-delayed, after which KASAN can poison them.

   - Delayed memcg charging (Shakeel Butt)

     In some cases, the memcg is uknown at allocation time, such as
     receiving network packets in softirq context. With
     kmem_cache_charge() these may be now charged later when the user
     and its memcg is known.

   - Misc fixes and improvements (Pedro Falcato, Axel Rasmussen,
     Christoph Lameter, Yan Zhen, Peng Fan, Xavier)"

* tag 'slab-for-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: (34 commits)
  mm, slab: restore kerneldoc for kmem_cache_create()
  io_uring: port to struct kmem_cache_args
  slab: make __kmem_cache_create() static inline
  slab: make kmem_cache_create_usercopy() static inline
  slab: remove kmem_cache_create_rcu()
  file: port to struct kmem_cache_args
  slab: create kmem_cache_create() compatibility layer
  slab: port KMEM_CACHE_USERCOPY() to struct kmem_cache_args
  slab: port KMEM_CACHE() to struct kmem_cache_args
  slab: remove rcu_freeptr_offset from struct kmem_cache
  slab: pass struct kmem_cache_args to do_kmem_cache_create()
  slab: pull kmem_cache_open() into do_kmem_cache_create()
  slab: pass struct kmem_cache_args to create_cache()
  slab: port kmem_cache_create_usercopy() to struct kmem_cache_args
  slab: port kmem_cache_create_rcu() to struct kmem_cache_args
  slab: port kmem_cache_create() to struct kmem_cache_args
  slab: add struct kmem_cache_args
  slab: s/__kmem_cache_create/do_kmem_cache_create/g
  memcg: add charging of already allocated slab objects
  mm/slab: Optimize the code logic in find_mergeable()
  ...
2024-09-18 08:53:53 +02:00
Linus Torvalds
6d450d120f Merge tag 'misc.2024.09.14a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu
Pull core dump update from Paul McKenney:
 "Sleep at TASK_IDLE when waiting for application core dump

  This causes the coredump_task_exit() function to sleep at TASK_IDLE,
  thus preventing task-blocked splats in case of large core dumps to
  slow devices"

* tag 'misc.2024.09.14a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu:
  exit: Sleep at TASK_IDLE when waiting for application core dump
2024-09-18 08:31:57 +02:00
Linus Torvalds
e651e0a473 Merge tag 'kcsan.2024.09.14a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu
Pull kcsan update from Paul McKenney:
 "Use min() to fix Coccinelle warning.

  Courtesy of Thorsten Blum"

* tag 'kcsan.2024.09.14a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu:
  kcsan: Use min() to fix Coccinelle warning
2024-09-18 08:28:59 +02:00
Linus Torvalds
067610ebaa Merge tag 'rcu.release.v6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux
Pull RCU updates from Neeraj Upadhyay:
 "Context tracking:
   - rename context tracking state related symbols and remove references
     to "dynticks" in various context tracking state variables and
     related helpers
   - force context_tracking_enabled_this_cpu() to be inlined to avoid
     leaving a noinstr section

  CSD lock:
   - enhance CSD-lock diagnostic reports
   - add an API to provide an indication of ongoing CSD-lock stall

  nocb:
   - update and simplify RCU nocb code to handle (de-)offloading of
     callbacks only for offline CPUs
   - fix RT throttling hrtimer being armed from offline CPU

  rcutorture:
   - remove redundant rcu_torture_ops get_gp_completed fields
   - add SRCU ->same_gp_state and ->get_comp_state functions
   - add generic test for NUM_ACTIVE_*RCU_POLL* for testing RCU and SRCU
     polled grace periods
   - add CFcommon.arch for arch-specific Kconfig options
   - print number of update types in rcu_torture_write_types()
   - add rcutree.nohz_full_patience_delay testing to the TREE07 scenario
   - add a stall_cpu_repeat module parameter to test repeated CPU stalls
   - add argument to limit number of CPUs a guest OS can use in
     torture.sh

  rcustall:
   - abbreviate RCU CPU stall warnings during CSD-lock stalls
   - Allow dump_cpu_task() to be called without disabling preemption
   - defer printing stall-warning backtrace when holding rcu_node lock

  srcu:
   - make SRCU gp seq wrap-around faster
   - add KCSAN checks for concurrent updates to ->srcu_n_exp_nodelay and
     ->reschedule_count which are used in heuristics governing
     auto-expediting of normal SRCU grace periods and
     grace-period-state-machine delays
   - mark idle SRCU-barrier callbacks to help identify stuck
     SRCU-barrier callback

  rcu tasks:
   - remove RCU Tasks Rude asynchronous APIs as they are no longer used
   - stop testing RCU Tasks Rude asynchronous APIs
   - fix access to non-existent percpu regions
   - check processor-ID assumptions during chosen CPU calculation for
     callback enqueuing
   - update description of rtp->tasks_gp_seq grace-period sequence
     number
   - add rcu_barrier_cb_is_done() to identify whether a given
     rcu_barrier callback is stuck
   - mark idle Tasks-RCU-barrier callbacks
   - add *torture_stats_print() functions to print detailed diagnostics
     for Tasks-RCU variants
   - capture start time of rcu_barrier_tasks*() operation to help
     distinguish a hung barrier operation from a long series of barrier
     operations

  refscale:
   - add a TINY scenario to support tests of Tiny RCU and Tiny
     SRCU
   - optimize process_durations() operation

  rcuscale:
   - dump stacks of stalled rcu_scale_writer() instances and
     grace-period statistics when rcu_scale_writer() stalls
   - mark idle RCU-barrier callbacks to identify stuck RCU-barrier
     callbacks
   - print detailed grace-period and barrier diagnostics on
     rcu_scale_writer() hangs for Tasks-RCU variants
   - warn if async module parameter is specified for RCU implementations
     that do not have async primitives such as RCU Tasks Rude
   - make all writer tasks report upon hang
   - tolerate repeated GFP_KERNEL failure in rcu_scale_writer()
   - use special allocator for rcu_scale_writer()
   - NULL out top-level pointers to heap memory to avoid double-free
     bugs on modprobe failures
   - maintain per-task instead of per-CPU callbacks count to avoid any
     issues with migration of either tasks or callbacks
   - constify struct ref_scale_ops

  Fixes:
   - use system_unbound_wq for kfree_rcu work to avoid disturbing
     isolated CPUs

  Misc:
   - warn on unexpected rcu_state.srs_done_tail state
   - better define "atomic" for list_replace_rcu() and
     hlist_replace_rcu() routines
   - annotate struct kvfree_rcu_bulk_data with __counted_by()"

* tag 'rcu.release.v6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux: (90 commits)
  rcu: Defer printing stall-warning backtrace when holding rcu_node lock
  rcu/nocb: Remove superfluous memory barrier after bypass enqueue
  rcu/nocb: Conditionally wake up rcuo if not already waiting on GP
  rcu/nocb: Fix RT throttling hrtimer armed from offline CPU
  rcu/nocb: Simplify (de-)offloading state machine
  context_tracking: Tag context_tracking_enabled_this_cpu() __always_inline
  context_tracking, rcu: Rename rcu_dyntick trace event into rcu_watching
  rcu: Update stray documentation references to rcu_dynticks_eqs_{enter, exit}()
  rcu: Rename rcu_momentary_dyntick_idle() into rcu_momentary_eqs()
  rcu: Rename rcu_implicit_dynticks_qs() into rcu_watching_snap_recheck()
  rcu: Rename dyntick_save_progress_counter() into rcu_watching_snap_save()
  rcu: Rename struct rcu_data .exp_dynticks_snap into .exp_watching_snap
  rcu: Rename struct rcu_data .dynticks_snap into .watching_snap
  rcu: Rename rcu_dynticks_zero_in_eqs() into rcu_watching_zero_in_eqs()
  rcu: Rename rcu_dynticks_in_eqs_since() into rcu_watching_snap_stopped_since()
  rcu: Rename rcu_dynticks_in_eqs() into rcu_watching_snap_in_eqs()
  rcu: Rename rcu_dynticks_eqs_online() into rcu_watching_online()
  context_tracking, rcu: Rename rcu_dynticks_curr_cpu_in_eqs() into rcu_is_watching_curr_cpu()
  context_tracking, rcu: Rename rcu_dynticks_task*() into rcu_task*()
  refscale: Constify struct ref_scale_ops
  ...
2024-09-18 07:52:24 +02:00
Linus Torvalds
85a77db95a Merge tag 'wq-for-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue updates from Tejun Heo:
 "Nothing major:

   - workqueue.panic_on_stall boot param added

   - alloc_workqueue_lockdep_map() added (used by DRM)

   - Other cleanusp and doc updates"

* tag 'wq-for-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  kernel/workqueue.c: fix DEFINE_PER_CPU_SHARED_ALIGNED expansion
  workqueue: Fix another htmldocs build warning
  workqueue: fix null-ptr-deref on __alloc_workqueue() error
  workqueue: Don't call va_start / va_end twice
  workqueue: Fix htmldocs build warning
  workqueue: Add interface for user-defined workqueue lockdep map
  workqueue: Change workqueue lockdep map to pointer
  workqueue: Split alloc_workqueue into internal function and lockdep init
  Documentation: kernel-parameters: add workqueue.panic_on_stall
  workqueue: add cmdline parameter workqueue.panic_on_stall
2024-09-18 06:59:44 +02:00
Linus Torvalds
78567e2bc7 Merge tag 'cgroup-for-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:

 - cpuset isolation improvements

 - cpuset cgroup1 support is split into its own file behind the new
   config option CONFIG_CPUSET_V1. This makes it the second controller
   which makes cgroup1 support optional after memcg

 - Handling of unavailable v1 controller handling improved during
   cgroup1 mount operations

 - union_find applied to cpuset. It makes code simpler and more
   efficient

 - Reduce spurious events in pids.events

 - Cleanups and other misc changes

 - Contains a merge of cgroup/for-6.11-fixes to receive cpuset fixes
   that further changes build upon

* tag 'cgroup-for-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (34 commits)
  cgroup: Do not report unavailable v1 controllers in /proc/cgroups
  cgroup: Disallow mounting v1 hierarchies without controller implementation
  cgroup/cpuset: Expose cpuset filesystem with cpuset v1 only
  cgroup/cpuset: Move cpu.h include to cpuset-internal.h
  cgroup/cpuset: add sefltest for cpuset v1
  cgroup/cpuset: guard cpuset-v1 code under CONFIG_CPUSETS_V1
  cgroup/cpuset: rename functions shared between v1 and v2
  cgroup/cpuset: move v1 interfaces to cpuset-v1.c
  cgroup/cpuset: move validate_change_legacy to cpuset-v1.c
  cgroup/cpuset: move legacy hotplug update to cpuset-v1.c
  cgroup/cpuset: add callback_lock helper
  cgroup/cpuset: move memory_spread to cpuset-v1.c
  cgroup/cpuset: move relax_domain_level to cpuset-v1.c
  cgroup/cpuset: move memory_pressure to cpuset-v1.c
  cgroup/cpuset: move common code to cpuset-internal.h
  cgroup/cpuset: introduce cpuset-v1.c
  selftest/cgroup: Make test_cpuset_prs.sh deal with pre-isolated CPUs
  cgroup/cpuset: Account for boot time isolated CPUs
  cgroup/cpuset: remove use_parent_ecpus of cpuset
  cgroup/cpuset: remove fetch_xcpus
  ...
2024-09-18 06:39:03 +02:00
Linus Torvalds
5ba202a7c9 Merge tag 'x86-build-2024-09-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 build updates from Thomas Gleixner:
 "Updates for KCOV instrumentation on x86:

   - Prevent spurious KCOV coverage in common_interrupt()

   - Fixup the KCOV Makefile directive which got stale due to a source
     file rename

   - Exclude stack unwinding from KCOV as it creates large amounts of
     uninteresting coverage

   - Provide a self test to validate that KCOV coverage of the interrupt
     handling code starts not before preempt count got updated"

* tag 'x86-build-2024-09-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86: Ignore stack unwinding in KCOV
  module: Fix KCOV-ignored file name
  kcov: Add interrupt handling self test
  x86/entry: Remove unwanted instrumentation in common_interrupt()
2024-09-17 12:40:34 +02:00
Linus Torvalds
c903327d32 Merge tag 'printk-for-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux
Pull printk updates from Petr Mladek:
 "This is the "last" part of the support for the new nbcon consoles.
  Where "nbcon" stays for "No Big console lock CONsoles" aka not under
  the console_lock.

  New callbacks are added to struct console:

   - write_thread() for flushing nbcon consoles in task context.

   - write_atomic() for flushing nbcon consoles in atomic context,
     including NMI.

   - con->device_lock() and device_unlock() for taking the driver
     specific lock, for example, port->lock.

  New printk-specific kthreads are created:

   - per-console kthreads which get responsible for flushing normal
     priority messages on nbcon consoles.

   - thread which gets responsible for flushing normal priority messages
     on all consoles when CONFIG_RT enabled.

  The new callbacks are called under a special per-console lock which
  has already been added back in v6.7. It allows to distinguish three
  severities: normal, emergency, and panic. A context with a higher
  priority could take over the ownership when it is safe even in the
  middle of handling a record. The panic context could do it even when
  it is not safe. But it is allowed only for the final desperate flush
  before entering the infinite loop.

  The new lock helps to flush the messages directly in emergency and
  panic contexts. But it is not enough in all situations:

   - console_lock() is still need for synchronization against boot
     consoles.

   - con->device_lock() is need for synchronization against other
     operations on the same HW, e.g. serial port speed setting,
     non-printk related read/write.

  The dependency on con->device_lock() is mutual. Any code taking the
  driver specific lock has to acquire the related nbcon console context
  as well. For example, see the new uart_port_lock() API. It provides
  the necessary synchronization against emergency and panic contexts
  where the messages are flushed only under the new per-console lock.

  Maybe surprisingly, a quite tricky part is the decision how to flush
  the consoles in various situations. It has to take into account:

   - message priority:    normal, emergency, panic

   - scheduling context:  task, atomic, deferred_legacy

   - registered consoles: boot, legacy, nbcon

   - threads are running: early boot, suspend, shutdown, panic

   - caller:              printk(), pr_flush(), printk_flush_in_panic(),
                          console_unlock(), console_start(), ...

  The primary decision is made in printk_get_console_flush_type(). It
  creates a hint what the caller should do:

   - flush nbcon consoles directly or via the kthread

   - call the legacy loop (console_unlock()) directly or via irq_work

  The existing behavior is preserved for the legacy consoles. The only
  exception is that they are not longer flushed directly from printk()
  in panic() before CPUs are stopped. But this blocking happens only
  when at least one nbcon console is registered. The motivation is to
  increase a chance to produce the crash dump. They legacy consoles
  might create a deadlock in compare with nbcon consoles. The nbcon
  console should allow to see the messages even when the crash dump
  fails.

  There are three possible ways how nbcon consoles are flushed:

   - The per-nbcon-console kthread is responsible for flushing messages
     added with the normal priority. This is the default mode.

   - The legacy loop, aka console_unlock(), is used when there is still
     a boot console registered. There is no easy way how to match an
     early console driver with a nbcon console driver. And the
     console_lock() provides the only reliable serialization at the
     moment.

     The legacy loop uses either con->write_atomic() or
     con->write_thread() callbacks depending on whether it is allowed to
     schedule. The atomic variant has to be used from printk().

   - In other situations, the messages are flushed directly using
     write_atomic() which can be called in any context, including NMI.
     It is primary needed during early boot or shutdown, in emergency
     situations, and panic.

  The emergency priority is used by a code called within
  nbcon_cpu_emergency_enter()/exit(). At the moment, it is used in four
  situations: WARN(), Oops, lockdep, and RCU stall reports.

  Finally, there is no nbcon console at the moment. It means that the
  changes should _not_ modify the existing behavior. The only exception
  is CONFIG_RT which would force offloading the legacy loop, for normal
  priority context, into the dedicated kthread"

* tag 'printk-for-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux: (54 commits)
  printk: Avoid false positive lockdep report for legacy printing
  printk: nbcon: Assign nice -20 for printing threads
  printk: Implement legacy printer kthread for PREEMPT_RT
  tty: sysfs: Add nbcon support for 'active'
  proc: Add nbcon support for /proc/consoles
  proc: consoles: Add notation to c_start/c_stop
  printk: nbcon: Show replay message on takeover
  printk: Provide helper for message prepending
  printk: nbcon: Rely on kthreads for normal operation
  printk: nbcon: Use thread callback if in task context for legacy
  printk: nbcon: Relocate nbcon_atomic_emit_one()
  printk: nbcon: Introduce printer kthreads
  printk: nbcon: Init @nbcon_seq to highest possible
  printk: nbcon: Add context to usable() and emit()
  printk: Flush console on unregister_console()
  printk: Fail pr_flush() if before SYSTEM_SCHEDULING
  printk: nbcon: Add function for printers to reacquire ownership
  printk: nbcon: Use raw_cpu_ptr() instead of open coding
  printk: Use the BITS_PER_LONG macro
  lockdep: Mark emergency sections in lockdep splats
  ...
2024-09-17 08:52:28 +02:00
Linus Torvalds
9ea925c806 Merge tag 'timers-core-2024-09-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner:
 "Core:

   - Overhaul of posix-timers in preparation of removing the workaround
     for periodic timers which have signal delivery ignored.

   - Remove the historical extra jiffie in msleep()

     msleep() adds an extra jiffie to the timeout value to ensure
     minimal sleep time. The timer wheel ensures minimal sleep time
     since the large rewrite to a non-cascading wheel, but the extra
     jiffie in msleep() remained unnoticed. Remove it.

   - Make the timer slack handling correct for realtime tasks.

     The procfs interface is inconsistent and does neither reflect
     reality nor conforms to the man page. Show the correct 0 slack for
     real time tasks and enforce it at the core level instead of having
     inconsistent individual checks in various timer setup functions.

   - The usual set of updates and enhancements all over the place.

  Drivers:

   - Allow the ACPI PM timer to be turned off during suspend

   - No new drivers

   - The usual updates and enhancements in various drivers"

* tag 'timers-core-2024-09-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (43 commits)
  ntp: Make sure RTC is synchronized when time goes backwards
  treewide: Fix wrong singular form of jiffies in comments
  cpu: Use already existing usleep_range()
  timers: Rename next_expiry_recalc() to be unique
  platform/x86:intel/pmc: Fix comment for the pmc_core_acpi_pm_timer_suspend_resume function
  clocksource/drivers/jcore: Use request_percpu_irq()
  clocksource/drivers/cadence-ttc: Add missing clk_disable_unprepare in ttc_setup_clockevent
  clocksource/drivers/asm9260: Add missing clk_disable_unprepare in asm9260_timer_init
  clocksource/drivers/qcom: Add missing iounmap() on errors in msm_dt_timer_init()
  clocksource/drivers/ingenic: Use devm_clk_get_enabled() helpers
  platform/x86:intel/pmc: Enable the ACPI PM Timer to be turned off when suspended
  clocksource: acpi_pm: Add external callback for suspend/resume
  clocksource/drivers/arm_arch_timer: Using for_each_available_child_of_node_scoped()
  dt-bindings: timer: rockchip: Add rk3576 compatible
  timers: Annotate possible non critical data race of next_expiry
  timers: Remove historical extra jiffie for timeout in msleep()
  hrtimer: Use and report correct timerslack values for realtime tasks
  hrtimer: Annotate hrtimer_cpu_base_.*_expiry() for sparse.
  timers: Add sparse annotation for timer_sync_wait_running().
  signal: Replace BUG_ON()s
  ...
2024-09-17 07:25:37 +02:00
Linus Torvalds
cb69d86550 Merge tag 'irq-core-2024-09-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq updates from Thomas Gleixner:
 "Core:

   - Remove a global lock in the affinity setting code

     The lock protects a cpumask for intermediate results and the lock
     causes a bottleneck on simultaneous start of multiple virtual
     machines. Replace the lock and the static cpumask with a per CPU
     cpumask which is nicely serialized by raw spinlock held when
     executing this code.

   - Provide support for giving a suffix to interrupt domain names.

     That's required to support devices with subfunctions so that the
     domain names are distinct even if they originate from the same
     device node.

   - The usual set of cleanups and enhancements all over the place

  Drivers:

   - Support for longarch AVEC interrupt chip

   - Refurbishment of the Armada driver so it can be extended for new
     variants.

   - The usual set of cleanups and enhancements all over the place"

* tag 'irq-core-2024-09-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (73 commits)
  genirq: Use cpumask_intersects()
  genirq/cpuhotplug: Use cpumask_intersects()
  irqchip/apple-aic: Only access system registers on SoCs which provide them
  irqchip/apple-aic: Add a new "Global fast IPIs only" feature level
  irqchip/apple-aic: Skip unnecessary enabling of use_fast_ipi
  dt-bindings: apple,aic: Document A7-A11 compatibles
  irqdomain: Use IS_ERR_OR_NULL() in irq_domain_trim_hierarchy()
  genirq/msi: Use kmemdup_array() instead of kmemdup()
  genirq/proc: Change the return value for set affinity permission error
  genirq/proc: Use irq_move_pending() in show_irq_affinity()
  genirq/proc: Correctly set file permissions for affinity control files
  genirq: Get rid of global lock in irq_do_set_affinity()
  genirq: Fix typo in struct comment
  irqchip/loongarch-avec: Add AVEC irqchip support
  irqchip/loongson-pch-msi: Prepare get_pch_msi_handle() for AVECINTC
  irqchip/loongson-eiointc: Rename CPUHP_AP_IRQ_LOONGARCH_STARTING
  LoongArch: Architectural preparation for AVEC irqchip
  LoongArch: Move irqchip function prototypes to irq-loongson.h
  irqchip/loongson-pch-msi: Switch to MSI parent domains
  softirq: Remove unused 'action' parameter from action callback
  ...
2024-09-17 07:09:17 +02:00
Linus Torvalds
a64405b78b Merge tag 'timers-clocksource-2024-09-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull clocksource watchdog updates from Thomas Gleixner:

 - Make the uncertainty margin handling more robust to prevent false
   positives

 - Clarify comments

* tag 'timers-clocksource-2024-09-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  clocksource: Set cs_watchdog_read() checks based on .uncertainty_margin
  clocksource: Fix comments on WATCHDOG_THRESHOLD & WATCHDOG_MAX_SKEW
  clocksource: Improve comments for watchdog skew bounds
2024-09-17 07:05:08 +02:00
Linus Torvalds
97e17c08a4 Merge tag 'smp-core-2024-09-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull CPU hotplug updates from Thomas Gleixner:

 - Prepare the core for supporting parallel hotplug on loongarch

 - A small set of cleanups and enhancements

* tag 'smp-core-2024-09-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  smp: Mark smp_prepare_boot_cpu() __init
  cpu: Fix W=1 build kernel-doc warning
  cpu/hotplug: Provide weak fallback for arch_cpuhp_init_parallel_bringup()
  cpu/hotplug: Make HOTPLUG_PARALLEL independent of HOTPLUG_SMT
2024-09-17 06:56:31 +02:00
Linus Torvalds
dc644fba3c Merge tag 'audit-pr-20240911' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit
Pull audit updates from Paul Moore:

 - Fix some remaining problems with PID/TGID reporting

   When most users think about PIDs, what they are really thinking about
   is the TGID. This commit shifts the audit PID logging and filtering
   to use the TGID value which should provide a more meaningful audit
   stream and filtering experience for users.

 - Migrate to the str_enabled_disabled() helper

   Evidently we have helper functions that help ensure if we mistype
   "enabled" or "disabled" it is now caught at compile time. I guess
   we're fancy now.

* tag 'audit-pr-20240911' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
  audit: Make use of str_enabled_disabled() helper
  audit: use task_tgid_nr() instead of task_pid_nr()
2024-09-16 16:52:37 +02:00
Linus Torvalds
8f72c31f45 Merge tag 'vfs-6.12.misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
 "This contains the usual pile of misc updates:

  Features:

   - Add F_CREATED_QUERY fcntl() that allows userspace to query whether
     a file was actually created. Often userspace wants to know whether
     an O_CREATE request did actually create a file without using
     O_EXCL. The current logic is that to first attempts to open the
     file without O_CREAT | O_EXCL and if ENOENT is returned userspace
     tries again with both flags. If that succeeds all is well. If it
     now reports EEXIST it retries.

     That works fairly well but some corner cases make this more
     involved. If this operates on a dangling symlink the first openat()
     without O_CREAT | O_EXCL will return ENOENT but the second openat()
     with O_CREAT | O_EXCL will fail with EEXIST.

     The reason is that openat() without O_CREAT | O_EXCL follows the
     symlink while O_CREAT | O_EXCL doesn't for security reasons. So
     it's not something we can really change unless we add an explicit
     opt-in via O_FOLLOW which seems really ugly.

     All available workarounds are really nasty (fanotify, bpf lsm etc)
     so add a simple fcntl().

   - Try an opportunistic lookup for O_CREAT. Today, when opening a file
     we'll typically do a fast lookup, but if O_CREAT is set, the kernel
     always takes the exclusive inode lock. This was likely done with
     the expectation that O_CREAT means that we always expect to do the
     create, but that's often not the case. Many programs set O_CREAT
     even in scenarios where the file already exists (see related
     F_CREATED_QUERY patch motivation above).

     The series contained in the pr rearranges the pathwalk-for-open
     code to also attempt a fast_lookup in certain O_CREAT cases. If a
     positive dentry is found, the inode_lock can be avoided altogether
     and it can stay in rcuwalk mode for the last step_into.

   - Expose the 64 bit mount id via name_to_handle_at()

     Now that we provide a unique 64-bit mount ID interface in statx(2),
     we can now provide a race-free way for name_to_handle_at(2) to
     provide a file handle and corresponding mount without needing to
     worry about racing with /proc/mountinfo parsing or having to open a
     file just to do statx(2).

     While this is not necessary if you are using AT_EMPTY_PATH and
     don't care about an extra statx(2) call, users that pass full paths
     into name_to_handle_at(2) need to know which mount the file handle
     comes from (to make sure they don't try to open_by_handle_at a file
     handle from a different filesystem) and switching to AT_EMPTY_PATH
     would require allocating a file for every name_to_handle_at(2) call

   - Add a per dentry expire timeout to autofs

     There are two fairly well known automounter map formats, the autofs
     format and the amd format (more or less System V and Berkley).

     Some time ago Linux autofs added an amd map format parser that
     implemented a fair amount of the amd functionality. This was done
     within the autofs infrastructure and some functionality wasn't
     implemented because it either didn't make sense or required extra
     kernel changes. The idea was to restrict changes to be within the
     existing autofs functionality as much as possible and leave changes
     with a wider scope to be considered later.

     One of these changes is implementing the amd options:
      1) "unmount", expire this mount according to a timeout (same as
         the current autofs default).
      2) "nounmount", don't expire this mount (same as setting the
         autofs timeout to 0 except only for this specific mount) .
      3) "utimeout=<seconds>", expire this mount using the specified
         timeout (again same as setting the autofs timeout but only for
         this mount)

     To implement these options per-dentry expire timeouts need to be
     implemented for autofs indirect mounts. This is because all map
     keys (mounts) for autofs indirect mounts use an expire timeout
     stored in the autofs mount super block info. structure and all
     indirect mounts use the same expire timeout.

  Fixes:

   - Fix missing fput for FSCONFIG_SET_FD in autofs

   - Use param->file for FSCONFIG_SET_FD in coda

   - Delete the 'fs/netfs' proc subtreee when netfs module exits

   - Make sure that struct uid_gid_map fits into a single cacheline

   - Don't flush in-flight wb switches for superblocks without cgroup
     writeback

   - Correcting the idmapping mount example in the idmapping
     documentation

   - Fix a race between evice_inodes() and find_inode() and iput()

   - Refine the show_inode_state() macro definition in writeback code

   - Prevent dump_mapping() from accessing invalid dentry.d_name.name

   - Show actual source for debugfs in /proc/mounts

   - Annotate data-race of busy_poll_usecs in eventpoll

   - Don't WARN for racy path_noexec check in exec code

   - Handle OOM on mnt_warn_timestamp_expiry()

   - Fix some spelling in the iomap design documentation

   - Fix typo in procfs comment

   - Fix typo in fs/namespace.c comment

  Cleanups:

   - Add the VFS git tree to the MAINTAINERS file

   - Move FMODE_UNSIGNED_OFFSET to fop_flags freeing up another f_mode
     bit in struct file bringing us to 5 free f_mode bits

   - Remove the __I_DIO_WAKEUP bit from i_state flags as we can simplify
     the wait mechanism

   - Remove the unused path_put_init() helper

   - Replace a __u32 with u32 for s_fsnotify_mask as __u32 is uapi
     specific

   - Replace the unsigned long i_state member with a u32 i_state member
     in struct inode freeing up 4 bytes in struct inode. Instead of
     using the bit based wait apis we're now using the var event apis
     and using the individual bytes of the i_state member to wait on
     state changes

   - Explain how per-syscall AT_* flags should be allocated

   - Use in_group_or_capable() helper to simplify the posix acl mode
     update code

   - Switch to LIST_HEAD() in fsync_buffers_list() to simplify the code

   - Removed comment about d_rcu_to_refcount() as that function doesn't
     exist anymore

   - Add kernel documentation for lookup_fast()

   - Don't re-zero evenpoll fields

   - Remove outdated comment after close_fd()

   - Fix imprecise wording in comment about the pipe filesystem

   - Drop GFP_NOFAIL mode from alloc_page_buffers

   - Missing blank line warnings and struct declaration improved in
     file_table

   - Annotate struct poll_list with __counted_by()

   - Remove the unused read parameter in percpu-rwsem

   - Remove linux/prefetch.h include from direct-io code

   - Use kmemdup_array instead of kmemdup for multiple allocation in
     mnt_idmapping code

   - Remove unused mnt_cursor_del() declaration

  Performance tweaks:

   - Dodge smp_mb in break_lease and break_deleg in the common case

   - Only read fops once in fops_{get,put}()

   - Use RCU in ilookup()

   - Elide smp_mb in iversion handling in the common case

   - Drop one lock trip in evict()"

* tag 'vfs-6.12.misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (58 commits)
  uidgid: make sure we fit into one cacheline
  proc: Fix typo in the comment
  fs/pipe: Correct imprecise wording in comment
  fhandle: expose u64 mount id to name_to_handle_at(2)
  uapi: explain how per-syscall AT_* flags should be allocated
  fs: drop GFP_NOFAIL mode from alloc_page_buffers
  writeback: Refine the show_inode_state() macro definition
  fs/inode: Prevent dump_mapping() accessing invalid dentry.d_name.name
  mnt_idmapping: Use kmemdup_array instead of kmemdup for multiple allocation
  netfs: Delete subtree of 'fs/netfs' when netfs module exits
  fs: use LIST_HEAD() to simplify code
  inode: make i_state a u32
  inode: port __I_LRU_ISOLATING to var event
  vfs: fix race between evice_inodes() and find_inode()&iput()
  inode: port __I_NEW to var event
  inode: port __I_SYNC to var event
  fs: reorder i_state bits
  fs: add i_state helpers
  MAINTAINERS: add the VFS git tree
  fs: s/__u32/u32/ for s_fsnotify_mask
  ...
2024-09-16 08:35:09 +02:00
Linus Torvalds
02824a5fd1 Merge tag 'pm-6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management updates from Rafael Wysocki:
 "By the number of new lines of code, the most visible change here is
  the addition of hybrid CPU capacity scaling support to the
  intel_pstate driver. Next are the amd-pstate driver changes related to
  the calculation of the AMD boost numerator and preferred core
  detection.

  As far as new hardware support is concerned, the intel_idle driver
  will now handle Granite Rapids Xeon processors natively, the
  intel_rapl power capping driver will recognize family 1Ah of AMD
  processors and Intel ArrowLake-U chipos, and intel_pstate will handle
  Granite Rapids and Sierra Forest chips in the out-of-band (OOB) mode.

  Apart from the above, there is a usual collection of assorted fixes
  and code cleanups in many places and there are tooling updates.

  Specifics:

   - Remove LATENCY_MULTIPLIER from cpufreq (Qais Yousef)

   - Add support for Granite Rapids and Sierra Forest in OOB mode to the
     intel_pstate cpufreq driver (Srinivas Pandruvada)

   - Add basic support for CPU capacity scaling on x86 and make the
     intel_pstate driver set asymmetric CPU capacity on hybrid systems
     without SMT (Rafael Wysocki)

   - Add missing MODULE_DESCRIPTION() macros to the powerpc cpufreq
     driver (Jeff Johnson)

   - Several OF related cleanups in cpufreq drivers (Rob Herring)

   - Enable COMPILE_TEST for ARM drivers (Rob Herrring)

   - Introduce quirks for syscon failures and use socinfo to get
     revision for TI cpufreq driver (Dhruva Gole, Nishanth Menon)

   - Minor cleanups in amd-pstate driver (Anastasia Belova, Dhananjay
     Ugwekar)

   - Minor cleanups for loongson, cpufreq-dt and powernv cpufreq drivers
     (Danila Tikhonov, Huacai Chen, and Liu Jing)

   - Make amd-pstate validate return of any attempt to update EPP
     limits, which fixes the masking hardware problems (Mario
     Limonciello)

   - Move the calculation of the AMD boost numerator outside of
     amd-pstate, correcting acpi-cpufreq on systems with preferred cores
     (Mario Limonciello)

   - Harden preferred core detection in amd-pstate to avoid potential
     false positives (Mario Limonciello)

   - Add extra unit test coverage for mode state machine (Mario
     Limonciello)

   - Fix an "Uninitialized variables" issue in amd-pstste (Qianqiang
     Liu)

   - Add Granite Rapids Xeon support to intel_idle (Artem Bityutskiy)

   - Disable promotion to C1E on Jasper Lake and Elkhart Lake in
     intel_idle (Kai-Heng Feng)

   - Use scoped device node handling to fix missing of_node_put() and
     simplify walking OF children in the riscv-sbi cpuidle driver
     (Krzysztof Kozlowski)

   - Remove dead code from cpuidle_enter_state() (Dhruva Gole)

   - Change an error pointer to NULL to fix error handling in the
     intel_rapl power capping driver (Dan Carpenter)

   - Fix off by one in get_rpi() in the intel_rapl power capping driver
     (Dan Carpenter)

   - Add support for ArrowLake-U to the intel_rapl power capping driver
     (Sumeet Pawnikar)

   - Fix the energy-pkg event for AMD CPUs in the intel_rapl power
     capping driver (Dhananjay Ugwekar)

   - Add support for AMD family 1Ah processors to the intel_rapl power
     capping driver (Dhananjay Ugwekar)

   - Remove unused stub for saveable_highmem_page() and remove
     deprecated macros from power management documentation (Andy
     Shevchenko)

   - Use ysfs_emit() and sysfs_emit_at() in "show" functions in the PM
     sysfs interface (Xueqin Luo)

   - Update the maintainers information for the
     operating-points-v2-ti-cpu DT binding (Dhruva Gole)

   - Drop unnecessary of_match_ptr() from ti-opp-supply (Rob Herring)

   - Add missing MODULE_DESCRIPTION() macros to devfreq governors (Jeff
     Johnson)

   - Use devm_clk_get_enabled() in the exynos-bus devfreq driver (Anand
     Moon)

   - Use of_property_present() instead of of_get_property() in the
     imx-bus devfreq driver (Rob Herring)

   - Update directory handling and installation process in the pm-graph
     Makefile and add .gitignore to ignore sleepgraph.py artifacts to
     pm-graph (Amit Vadhavana, Yo-Jung Lin)

   - Make cpupower display residency value in idle-info (Aboorva
     Devarajan)

   - Add missing powercap_set_enabled() stub function to cpupower (John
     B. Wyatt IV)

   - Add SWIG support to cpupower (John B. Wyatt IV)"

* tag 'pm-6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (62 commits)
  cpufreq/amd-pstate-ut: Fix an "Uninitialized variables" issue
  cpufreq/amd-pstate-ut: Add test case for mode switches
  cpufreq/amd-pstate: Export symbols for changing modes
  amd-pstate: Add missing documentation for `amd_pstate_prefcore_ranking`
  cpufreq: amd-pstate: Add documentation for `amd_pstate_hw_prefcore`
  cpufreq: amd-pstate: Optimize amd_pstate_update_limits()
  cpufreq: amd-pstate: Merge amd_pstate_highest_perf_set() into amd_get_boost_ratio_numerator()
  x86/amd: Detect preferred cores in amd_get_boost_ratio_numerator()
  x86/amd: Move amd_get_highest_perf() out of amd-pstate
  ACPI: CPPC: Adjust debug messages in amd_set_max_freq_ratio() to warn
  ACPI: CPPC: Drop check for non zero perf ratio
  x86/amd: Rename amd_get_highest_perf() to amd_get_boost_ratio_numerator()
  ACPI: CPPC: Adjust return code for inline functions in !CONFIG_ACPI_CPPC_LIB
  x86/amd: Move amd_get_highest_perf() from amd.c to cppc.c
  PM: hibernate: Remove unused stub for saveable_highmem_page()
  pm:cpupower: Add error warning when SWIG is not installed
  MAINTAINERS: Add Maintainers for SWIG Python bindings
  pm:cpupower: Include test_raw_pylibcpupower.py
  pm:cpupower: Add SWIG bindings files for libcpupower
  pm:cpupower: Add missing powercap_set_enabled() stub function
  ...
2024-09-16 07:47:50 +02:00
Linus Torvalds
114143a595 Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Will Deacon:
 "The highlights are support for Arm's "Permission Overlay Extension"
  using memory protection keys, support for running as a protected guest
  on Android as well as perf support for a bunch of new interconnect
  PMUs.

  Summary:

  ACPI:
   - Enable PMCG erratum workaround for HiSilicon HIP10 and 11
     platforms.
   - Ensure arm64-specific IORT header is covered by MAINTAINERS.

  CPU Errata:
   - Enable workaround for hardware access/dirty issue on Ampere-1A
     cores.

  Memory management:
   - Define PHYSMEM_END to fix a crash in the amdgpu driver.
   - Avoid tripping over invalid kernel mappings on the kexec() path.
   - Userspace support for the Permission Overlay Extension (POE) using
     protection keys.

  Perf and PMUs:
   - Add support for the "fixed instruction counter" extension in the
     CPU PMU architecture.
   - Extend and fix the event encodings for Apple's M1 CPU PMU.
   - Allow LSM hooks to decide on SPE permissions for physical
     profiling.
   - Add support for the CMN S3 and NI-700 PMUs.

  Confidential Computing:
   - Add support for booting an arm64 kernel as a protected guest under
     Android's "Protected KVM" (pKVM) hypervisor.

  Selftests:
   - Fix vector length issues in the SVE/SME sigreturn tests
   - Fix build warning in the ptrace tests.

  Timers:
   - Add support for PR_{G,S}ET_TSC so that 'rr' can deal with
     non-determinism arising from the architected counter.

  Miscellaneous:
   - Rework our IPI-based CPU stopping code to try NMIs if regular IPIs
     don't succeed.
   - Minor fixes and cleanups"

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (94 commits)
  perf: arm-ni: Fix an NULL vs IS_ERR() bug
  arm64: hibernate: Fix warning for cast from restricted gfp_t
  arm64: esr: Define ESR_ELx_EC_* constants as UL
  arm64: pkeys: remove redundant WARN
  perf: arm_pmuv3: Use BR_RETIRED for HW branch event if enabled
  MAINTAINERS: List Arm interconnect PMUs as supported
  perf: Add driver for Arm NI-700 interconnect PMU
  dt-bindings/perf: Add Arm NI-700 PMU
  perf/arm-cmn: Improve format attr printing
  perf/arm-cmn: Clean up unnecessary NUMA_NO_NODE check
  arm64/mm: use lm_alias() with addresses passed to memblock_free()
  mm: arm64: document why pte is not advanced in contpte_ptep_set_access_flags()
  arm64: Expose the end of the linear map in PHYSMEM_END
  arm64: trans_pgd: mark PTEs entries as valid to avoid dead kexec()
  arm64/mm: Delete __init region from memblock.reserved
  perf/arm-cmn: Support CMN S3
  dt-bindings: perf: arm-cmn: Add CMN S3
  perf/arm-cmn: Refactor DTC PMU register access
  perf/arm-cmn: Make cycle counts less surprising
  perf/arm-cmn: Improve build-time assertion
  ...
2024-09-16 06:55:07 +02:00
Linus Torvalds
85ffc6e4ed Merge tag 'v6.12-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto update from Herbert Xu"
 "API:
   - Make self-test asynchronous

  Algorithms:
   - Remove MPI functions added for SM3
   - Add allocation error checks to remaining MPI functions (introduced
     for SM3)
   - Set default Jitter RNG OSR to 3

  Drivers:
   - Add hwrng driver for Rockchip RK3568 SoC
   - Allow disabling SR-IOV VFs through sysfs in qat
   - Fix device reset bugs in hisilicon
   - Fix authenc key parsing by using generic helper in octeontx*

  Others:
   - Fix xor benchmarking on parisc"

* tag 'v6.12-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (96 commits)
  crypto: n2 - Set err to EINVAL if snprintf fails for hmac
  crypto: camm/qi - Use ERR_CAST() to return error-valued pointer
  crypto: mips/crc32 - Clean up useless assignment operations
  crypto: qcom-rng - rename *_of_data to *_match_data
  crypto: qcom-rng - fix support for ACPI-based systems
  dt-bindings: crypto: qcom,prng: document support for SA8255p
  crypto: aegis128 - Fix indentation issue in crypto_aegis128_process_crypt()
  crypto: octeontx* - Select CRYPTO_AUTHENC
  crypto: testmgr - Hide ENOENT errors
  crypto: qat - Remove trailing space after \n newline
  crypto: hisilicon/sec - Remove trailing space after \n newline
  crypto: algboss - Pass instance creation error up
  crypto: api - Fix generic algorithm self-test races
  crypto: hisilicon/qm - inject error before stopping queue
  crypto: hisilicon/hpre - mask cluster timeout error
  crypto: hisilicon/qm - reset device before enabling it
  crypto: hisilicon/trng - modifying the order of header files
  crypto: hisilicon - add a lock for the qp send operation
  crypto: hisilicon - fix missed error branch
  crypto: ccp - do not request interrupt on cmd completion when irqs disabled
  ...
2024-09-16 06:28:28 +02:00
Linus Torvalds
9410645520 Merge tag 'net-next-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Jakub Kicinski:
 "The zero-copy changes are relatively significant, but regression risk
  should be contained. The feature needs to be used to cause trouble.

  Also it feels like we got an order of magnitude more semi-automated
  "refactoring" chaff than usual, I wonder if it's just us.

  Core & protocols:

   - Support Device Memory TCP, ability to zero-copy receive TCP
     payloads to a DMABUF region of memory while packet headers land
     separately in normal kernel buffers, and TCP processes then as
     usual.

   - The ability to read the PTP PHC (Physical Hardware Clock) alongside
     MONOTONIC_RAW timestamps with PTP_SYS_OFFSET_EXTENDED. Previously
     only CLOCK_REALTIME was supported.

   - Allow matching on all bits of IP DSCP for routing decisions.
     Previously we only supported on matching TOS bits in IPv4 which is
     a narrower interpretation of the same header field.

   - Increase the range of weights used for multi-path routing from
     8 bits to 16 bits.

   - Add support for IPv6 PIO p flag in the Prefix Information Option
     per draft-ietf-6man-pio-pflag.

   - IPv6 IOAM6 support for new tunsrc encap mode for better
     performance.

   - Detect destinations which blackhole MPTCP traffic and avoid
     initiating MPTCP connections to them for a certain period of time,
     1h by default.

   - Improve IPsec control path performance by removing the inexact
     policies list.

   - AF_VSOCK: add support for SIOCOUTQ ioctl.

   - Add enum for reasons TCP reset was sent for easier tracing.

   - Add SMC ringbufs usage statistics.

  Drivers:

   - Handle netconsole setup failures more gracefully, don't fail
     loading, retain the specified target as disabled.

   - Extend bonding's IPsec offload pass thru capabilities (ESN, stats).

  Filtering:

   - Add TCP_BPF_SOCK_OPS_CB_FLAGS to bpf_*sockopt() to address the case
     when long-lived sockets miss a chance to set additional callbacks
     if a sockops program was not attached early in their lifetime.

   - Support using BPF skb helpers in tracepoints.

   - Conntrack Netlink: support CTA_FILTER for flush.

   - Improve SCTP support in nfnetlink_queue.

   - Improve performance of large nftables flush transactions.

  Things we sprinkled into general kernel code:

   - selftests: support setting an "interpreter" for script files; make
     it easy to run as separate cases tests where one "interpreter" is
     fed various test descriptions (in our case packet sequences).

  Driver API:

   - Extend core and ethtool APIs to support many PHYs connected to a
     single interface (PHY topologies).

   - Extend cable diagnostics to specify whether Time Domain
     Reflectometry (TDR) or Active Link Cable Diagnostic (ALCD) was
     used.

   - Add library for implementing MAC-PHY Ethernet drivers for SPI
     devices compatible with Open Alliance 10BASE-T1x MAC-PHY Serial
     Interface (TC6) standard.

   - Add helpers to the PHY framework, for PHYs following the Open
     Alliance standards:
       - 1000BaseT1 link settings
       - cable test and diagnostics

   - Support listing / dumping all allocated RSS contexts.

   - Add configuration for frequency Embedded SYNC in DPLL, which
     magically embeds sync pulses into Ethernet signaling.

  Device drivers:

   - Ethernet high-speed NICs:
      - Broadcom (bnxt):
         - use better FW APIs for queue reset
         - support QOS and TPID settings for the SR-IOV VLAN
         - support dynamic MSI-X allocation
      - Intel (100G, ice, idpf):
         - ice: support PCIe subfunctions
         - iavf: add support for TC U32 filters on VFs
         - ice: support Embedded SYNC in DPLL
      - nVidia/Mellanox (mlx5):
         - support HW managed steering tables
         - support PCIe PTM cross timestamping
      - AMD/Pensando:
         - ionic: use page_pool to increase Rx performance
      - Cisco (enic):
         - report per-queue statistics

   - Ethernet virtual:
      - Microsoft vNIC:
         - mana: support configuring ring length
         - netvsc: enable more channels on systems with many CPUs
      - IBM veth:
         - optimize polling to improve TCP_RR performance
         - optimize performance of Tx handling
      - VirtIO net:
         - synchronize the operstate with the admin state to allow a
           lower virtio-net to propagate the link status to an upper
           device like macvlan

   - Ethernet NICs consumer, and embedded:
      - Add driver for Realtek automotive PCIe devices (RTL9054,
        RTL9068, RTL9072, RTL9075, RTL9068, RTL9071)
      - Add driver for Microchip LAN8650/1 10BASE-T1S MAC-PHY.
      - Microchip:
         - lan743x: use phylink - support WOL, EEE, pause, link settings
         - add Wake-on-LAN support for KSZ87xx family
         - add KSZ8895/KSZ8864 switch support
         - factor out FDMA code and use it in sparx5 and lan966x
           (including DCB support in both)
      - Synopsys (stmmac):
         - support frame preemption (configured using TC and ethtool)
         - support Loongson DWMAC (GMAC v3.73)
         - support RockChips RK3576 DWMAC
      - TI:
         - am65-cpsw: add multi queue RX support
         - icssg-prueth: HSR offload support
      - Cadence (macb):
         - enable software (hrtimer based) IRQ coalescing by default
      - Xilinx (axinet):
         - expose HW statistics
         - improve multicast filtering
         - relax Rx checksum offload constraints
      - MediaTek:
         - mt7530: add EN7581 support
      - Aspeed (ftgmac100):
         - report link speed and duplex
      - Intel:
         - igc: add mqprio offload
         - igc: report EEE configuration
      - RealTek (r8169):
         - add support for RTL8126A rev.b
      - Vitesse (vsc73xx):
         - implement FDB add/del/dump operations
      - Freescale (fs_enet):
         - use phylink

   - Ethernet PHYs:
      - vitesse: implement downshift and MDI-X in vsc73xx PHYs
      - microchip: support LAN887x, supporting IEEE 802.3bw (100BASE-T1)
        and IEEE 802.3bp (1000BASE-T1) specifications
      - add Applied Micro QT2025 PHY driver (in Rust)
      - add Motorcomm yt8821 2.5G Ethernet PHY driver

   - CAN:
      - add driver for Rockchip RK3568 CAN-FD controller
      - flexcan: add wakeup support for imx95
      - kvaser_usb: set hardware timestamp on transmitted packets

   - WiFi:
      - mac80211/cfg80211:
         - EHT rate support in AQL airtime fairness
         - handle DFS (radar detection) per link in Multi-Link Operation
      - RealTek (rtw89):
         - support RTL8852BT and 8852BE-VT (WiFi 6)
         - support hardware rfkill
         - support HW encryption in unicast management frames
         - support Wake-on-WLAN with supported network detection
      - RealTek (rtw89):
         - improve Rx performance by using USB frame aggregation
         - support USB 3 with RTL8822CU/RTL8822BU
      - Intel (iwlwifi/mvm):
         - offload RLC/SMPS functionality to firmware
      - Marvell (mwifiex):
         - add host based MLME to enable WPA3

   - Bluetooth:
      - add support for Amlogic HCI UART protocol
      - add support for ISO data/packets to Intel and NXP drivers"

* tag 'net-next-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1303 commits)
  net/mlx5: HWS, check the correct variable in hws_send_ring_alloc_sq()
  netfilter: nft_socket: Fix a NULL vs IS_ERR() bug in nft_socket_cgroup_subtree_level()
  ice: Fix a NULL vs IS_ERR() check in probe()
  ice: Fix a couple NULL vs IS_ERR() bugs
  net: ethernet: fs_enet: Make the per clock optional
  net: ti: icssg-prueth: Add multicast filtering support in HSR mode
  net: ti: icssg-prueth: Enable HSR Tx duplication, Tx Tag and Rx Tag offload
  net: ti: icssg-prueth: Add support for HSR frame forward offload
  net: ti: icssg-prueth: Stop hardcoding def_inc
  net: ti: icss-iep: Move icss_iep structure
  net: ibm: emac: get rid of wol_irq
  net: ibm: emac: remove all waiting code
  net: ibm: emac: replace of_get_property
  net: ibm: emac: use netdev's phydev directly
  net: ibm: emac: use devm for register_netdev
  net: ibm: emac: remove mii_bus with devm
  net: ibm: emac: use devm for of_iomap
  net: ibm: emac: manage emac_irq with devm
  net: ibm: emac: use devm for alloc_etherdev
  octeontx2-af: debugfs: Add Channel info to RPM map
  ...
2024-09-16 06:02:27 +02:00
Jakub Kicinski
3b7dc7000e Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2024-09-11

We've added 12 non-merge commits during the last 16 day(s) which contain
a total of 20 files changed, 228 insertions(+), 30 deletions(-).

There's a minor merge conflict in drivers/net/netkit.c:
  00d066a4d4 ("netdev_features: convert NETIF_F_LLTX to dev->lltx")
  d966087948 ("netkit: Disable netpoll support")

The main changes are:

1) Enable bpf_dynptr_from_skb for tp_btf such that this can be used
   to easily parse skbs in BPF programs attached to tracepoints,
   from Philo Lu.

2) Add a cond_resched() point in BPF's sock_hash_free() as there have
   been several syzbot soft lockup reports recently, from Eric Dumazet.

3) Fix xsk_buff_can_alloc() to account for queue_empty_descs which
   got noticed when zero copy ice driver started to use it,
   from Maciej Fijalkowski.

4) Move the xdp:xdp_cpumap_kthread tracepoint before cpumap pushes skbs
   up via netif_receive_skb_list() to better measure latencies,
   from Daniel Xu.

5) Follow-up to disable netpoll support from netkit, from Daniel Borkmann.

6) Improve xsk selftests to not assume a fixed MAX_SKB_FRAGS of 17 but
   instead gather the actual value via /proc/sys/net/core/max_skb_frags,
   also from Maciej Fijalkowski.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next:
  sock_map: Add a cond_resched() in sock_hash_free()
  selftests/bpf: Expand skb dynptr selftests for tp_btf
  bpf: Allow bpf_dynptr_from_skb() for tp_btf
  tcp: Use skb__nullable in trace_tcp_send_reset
  selftests/bpf: Add test for __nullable suffix in tp_btf
  bpf: Support __nullable argument suffix for tp_btf
  bpf, cpumap: Move xdp:xdp_cpumap_kthread tracepoint before rcv
  selftests/xsk: Read current MAX_SKB_FRAGS from sysctl knob
  xsk: Bump xsk_queue::queue_empty_descs in xp_can_alloc()
  tcp_bpf: Remove an unused parameter for bpf_tcp_ingress()
  bpf, sockmap: Correct spelling skmsg.c
  netkit: Disable netpoll support

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================

Link: https://patch.msgid.link/20240911211525.13834-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-12 20:22:44 -07:00
Linus Torvalds
5da028864f Merge tag 'wq-for-6.11-rc7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue fix from Tejun Heo:
 "A fix for a NULL worker->pool deref bug which can be triggered when a
  worker is created and then destroyed immediately"

* tag 'wq-for-6.11-rc7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  workqueue: Clear worker->pool in the worker thread context
2024-09-12 13:11:10 -07:00
Christian Brauner
2077006d47 uidgid: make sure we fit into one cacheline
When I expanded uidgid mappings I intended for a struct uid_gid_map to
fit into a single cacheline on x86 as they tend to be pretty
performance sensitive (idmapped mounts etc). But a 4 byte hole was added
that brought it over 64 bytes. Fix that by adding the static extent
array and the extent counter into a substruct. C's type punning for
unions guarantees that we can access ->nr_extents even if the last
written to member wasn't within the same object. This is also what we
rely on in struct_group() and friends. This of course relies on
non-strict aliasing which we don't do.

99) If the member used to read the contents of a union object is not the
    same as the member last used to store a value in the object, the
    appropriate part of the object representation of the value is
    reinterpreted as an object representation in the new type as
    described in 6.2.6 (a process sometimes called "type punning").

Link: https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2310.pdf
Link: https://lore.kernel.org/r/20240910-work-uid_gid_map-v1-1-e6bc761363ed@kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-09-12 12:16:09 +02:00
Lai Jiangshan
73613840a8 workqueue: Clear worker->pool in the worker thread context
Marc Hartmayer reported:
        [   23.133876] Unable to handle kernel pointer dereference in virtual kernel address space
        [   23.133950] Failing address: 0000000000000000 TEID: 0000000000000483
        [   23.133954] Fault in home space mode while using kernel ASCE.
        [   23.133957] AS:000000001b8f0007 R3:0000000056cf4007 S:0000000056cf3800 P:000000000000003d
        [   23.134207] Oops: 0004 ilc:2 [#1] SMP
	(snip)
        [   23.134516] Call Trace:
        [   23.134520]  [<0000024e326caf28>] worker_thread+0x48/0x430
        [   23.134525] ([<0000024e326caf18>] worker_thread+0x38/0x430)
        [   23.134528]  [<0000024e326d3a3e>] kthread+0x11e/0x130
        [   23.134533]  [<0000024e3264b0dc>] __ret_from_fork+0x3c/0x60
        [   23.134536]  [<0000024e333fb37a>] ret_from_fork+0xa/0x38
        [   23.134552] Last Breaking-Event-Address:
        [   23.134553]  [<0000024e333f4c04>] mutex_unlock+0x24/0x30
        [   23.134562] Kernel panic - not syncing: Fatal exception: panic_on_oops

With debuging and analysis, worker_thread() accesses to the nullified
worker->pool when the newly created worker is destroyed before being
waken-up, in which case worker_thread() can see the result detach_worker()
reseting worker->pool to NULL at the begining.

Move the code "worker->pool = NULL;" out from detach_worker() to fix the
problem.

worker->pool had been designed to be constant for regular workers and
changeable for rescuer. To share attaching/detaching code for regular
and rescuer workers and to avoid worker->pool being accessed inadvertently
when the worker has been detached, worker->pool is reset to NULL when
detached no matter the worker is rescuer or not.

To maintain worker->pool being reset after detached, move the code
"worker->pool = NULL;" in the worker thread context after detached.

It is either be in the regular worker thread context after PF_WQ_WORKER
is cleared or in rescuer worker thread context with wq_pool_attach_mutex
held. So it is safe to do so.

Cc: Marc Hartmayer <mhartmay@linux.ibm.com>
Link: https://lore.kernel.org/lkml/87wmjj971b.fsf@linux.ibm.com/
Reported-by: Marc Hartmayer <mhartmay@linux.ibm.com>
Fixes: f4b7b53c94 ("workqueue: Detach workers directly in idle_cull_fn()")
Cc: stable@vger.kernel.org # v6.11+
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-09-11 19:58:20 -10:00
Linus Torvalds
914413e3ee Merge tag 'printk-for-6.11-fixup' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux
Pull printk fix from Petr Mladek:

 - Fix build of serial_core as a module

* tag 'printk-for-6.11-fixup' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux:
  printk: Export match_devname_and_update_preferred_console()
2024-09-11 11:13:20 -07:00
Baoquan He
b4722b8593 kernel/workqueue.c: fix DEFINE_PER_CPU_SHARED_ALIGNED expansion
Make tags always produces below annoying warnings:

ctags: Warning: kernel/workqueue.c:470: null expansion of name pattern "\1"
ctags: Warning: kernel/workqueue.c:474: null expansion of name pattern "\1"
ctags: Warning: kernel/workqueue.c:478: null expansion of name pattern "\1"

In commit 25528213fe ("tags: Fix DEFINE_PER_CPU expansions"), codes in
places have been adjusted including cpu_worker_pools definition. I noticed
in commit 4cb1ef6460 ("workqueue: Implement BH workqueues to eventually
replace tasklets"), cpu_worker_pools definition was unfolded back. Not
sure if it was intentionally done or ignored carelessly.

Makes change to mute them specifically.

Signed-off-by: Baoquan He <bhe@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-09-11 08:10:44 -10:00
Rafael J. Wysocki
0a06811d66 Merge branches 'pm-sleep', 'pm-opp' and 'pm-tools'
Merge updates related to system sleep, operating performance points
(OPP) updates, and PM tooling updates for 6.12-rc1:

 - Remove unused stub for saveable_highmem_page() and remove deprecated
   macros from power management documentation (Andy Shevchenko).

 - Use ysfs_emit() and sysfs_emit_at() in "show" functions in the PM
   sysfs interface (Xueqin Luo).

 - Update the maintainers information for the operating-points-v2-ti-cpu DT
   binding (Dhruva Gole).

 - Drop unnecessary of_match_ptr() from ti-opp-supply (Rob Herring).

 - Update directory handling and installation process in the pm-graph
   Makefile and add .gitignore to ignore sleepgraph.py artifacts to
   pm-graph (Amit Vadhavana, Yo-Jung Lin).

 - Make cpupower display residency value in idle-info (Aboorva
   Devarajan).

 - Add missing powercap_set_enabled() stub function to cpupower (John
   B. Wyatt IV).

 - Add SWIG support to cpupower (John B. Wyatt IV).

* pm-sleep:
  PM: hibernate: Remove unused stub for saveable_highmem_page()
  Documentation: PM: Discourage use of deprecated macros
  PM: sleep: Use sysfs_emit() and sysfs_emit_at() in "show" functions
  PM: hibernate: Use sysfs_emit() and sysfs_emit_at() in "show" functions

* pm-opp:
  dt-bindings: opp: operating-points-v2-ti-cpu: Update maintainers
  opp: ti: Drop unnecessary of_match_ptr()

* pm-tools:
  pm:cpupower: Add error warning when SWIG is not installed
  MAINTAINERS: Add Maintainers for SWIG Python bindings
  pm:cpupower: Include test_raw_pylibcpupower.py
  pm:cpupower: Add SWIG bindings files for libcpupower
  pm:cpupower: Add missing powercap_set_enabled() stub function
  pm-graph: Update directory handling and installation process in Makefile
  pm-graph: Make git ignore sleepgraph.py artifacts
  tools/cpupower: display residency value in idle-info
2024-09-11 19:02:23 +02:00
Philo Lu
8aeaed21be bpf: Support __nullable argument suffix for tp_btf
Pointers passed to tp_btf were trusted to be valid, but some tracepoints
do take NULL pointer as input, such as trace_tcp_send_reset(). Then the
invalid memory access cannot be detected by verifier.

This patch fix it by add a suffix "__nullable" to the unreliable
argument. The suffix is shown in btf, and PTR_MAYBE_NULL will be added
to nullable arguments. Then users must check the pointer before use it.

A problem here is that we use "btf_trace_##call" to search func_proto.
As it is a typedef, argument names as well as the suffix are not
recorded. To solve this, I use bpf_raw_event_map to find
"__bpf_trace##template" from "btf_trace_##call", and then we can see the
suffix.

Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Philo Lu <lulie@linux.alibaba.com>
Link: https://lore.kernel.org/r/20240911033719.91468-2-lulie@linux.alibaba.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-09-11 08:56:37 -07:00
Daniel Xu
23dc986732 bpf, cpumap: Move xdp:xdp_cpumap_kthread tracepoint before rcv
cpumap takes RX processing out of softirq and onto a separate kthread.
Since the kthread needs to be scheduled in order to run (versus softirq
which does not), we can theoretically experience extra latency if the
system is under load and the scheduler is being unfair to us.

Moving the tracepoint to before passing the skb list up the stack allows
users to more accurately measure enqueue/dequeue latency introduced by
cpumap via xdp:xdp_cpumap_enqueue and xdp:xdp_cpumap_kthread tracepoints.

f9419f7bd7 ("bpf: cpumap add tracepoints") which added the tracepoints
states that the intent behind them was for general observability and for
a feedback loop to see if the queues are being overwhelmed. This change
does not mess with either of those use cases but rather adds a third
one.

Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Link: https://lore.kernel.org/bpf/47615d5b5e302e4bd30220473779e98b492d47cd.1725585718.git.dxu@dxuuu.xyz
2024-09-11 16:32:11 +02:00
Petr Mladek
2c83ded8ae Merge branch 'for-6.11-fixup' into for-linus 2024-09-11 09:30:22 +02:00
Michal Koutný
af000ce852 cgroup: Do not report unavailable v1 controllers in /proc/cgroups
This is a followup to CONFIG-urability of cpuset and memory controllers
for v1 hierarchies. Make the output in /proc/cgroups reflect that
!CONFIG_CPUSETS_V1 is like !CONFIG_CPUSETS and
!CONFIG_MEMCG_V1 is like !CONFIG_MEMCG.

The intended effect is that hiding the unavailable controllers will hint
users not to try mounting them on v1.

Signed-off-by: Michal Koutný <mkoutny@suse.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-09-10 10:04:28 -10:00
Michal Koutný
3c41382e92 cgroup: Disallow mounting v1 hierarchies without controller implementation
The configs that disable some v1 controllers would still allow mounting
them but with no controller-specific files. (Making such hierarchies
equivalent to named v1 hierarchies.) To achieve behavior consistent with
actual out-compilation of a whole controller, the mounts should treat
respective controllers as non-existent.

Wrap implementation into a helper function, leverage legacy_files to
detect compiled out controllers. The effect is that mounts on v1 would
fail and produce a message like:
  [ 1543.999081] cgroup: Unknown subsys name 'memory'

Signed-off-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-09-10 10:03:43 -10:00
Michal Koutný
659f90f863 cgroup/cpuset: Expose cpuset filesystem with cpuset v1 only
The cpuset filesystem is a legacy interface to cpuset controller with
(pre-)v1 features. It makes little sense to co-mount it on systems
without cpuset v1, so do not build it when cpuset v1 is not built
neither.

Signed-off-by: Michal Koutný <mkoutny@suse.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-09-10 10:02:12 -10:00
Andy Shevchenko
c3565a35d9 PM: hibernate: Remove unused stub for saveable_highmem_page()
When saveable_highmem_page() is unused, it prevents kernel builds
with clang, `make W=1` and CONFIG_WERROR=y:

kernel/power/snapshot.c:1369:21: error: unused function 'saveable_highmem_page' [-Werror,-Wunused-function]
 1369 | static inline void *saveable_highmem_page(struct zone *z, unsigned long p)
      |                     ^~~~~~~~~~~~~~~~~~~~~

Fix this by removing unused stub.

See also commit 6863f5643d ("kbuild: allow Clang to find unused static
inline functions for W=1 build").

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Link: https://patch.msgid.link/20240905184848.318978-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2024-09-10 20:11:40 +02:00
Linus Torvalds
8d8d276ba2 Merge tag 'trace-v6.11-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:

 - Move declaration of interface_lock outside of CONFIG_TIMERLAT_TRACER

   The fix to some locking races moved the declaration of the
   interface_lock up in the file, but also moved it into the
   CONFIG_TIMERLAT_TRACER #ifdef block, breaking the build when that
   wasn't set. Move it further up and out of that #ifdef block.

 - Remove unused function run_tracer_selftest() stub

   When CONFIG_FTRACE_STARTUP_TEST is not set the stub function
   run_tracer_selftest() is not used and clang is warning about it.
   Remove the function stub as it is not needed.

* tag 'trace-v6.11-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing: Drop unused helper function to fix the build
  tracing/osnoise: Fix build when timerlat is not enabled
2024-09-10 09:05:20 -07:00
Benjamin ROBIN
35b603f8a7 ntp: Make sure RTC is synchronized when time goes backwards
sync_hw_clock() is normally called every 11 minutes when time is
synchronized. This issue is that this periodic timer uses the REALTIME
clock, so when time moves backwards (the NTP server jumps into the past),
the timer expires late.

If the timer expires late, which can be days later, the RTC will no longer
be updated, which is an issue if the device is abruptly powered OFF during
this period. When the device will restart (when powered ON), it will have
the date prior to the ADJ_SETOFFSET call.

A normal NTP server should not jump in the past like that, but it is
possible... Another way of reproducing this issue is to use phc2sys to
synchronize the REALTIME clock with, for example, an IRIG timecode with
the source always starting at the same date (not synchronized).

Also, if the time jump in the future by less than 11 minutes, the RTC may
not be updated immediately (minor issue). Consider the following scenario:
 - Time is synchronized, and sync_hw_clock() was just called (the timer
   expires in 11 minutes).
 - A time jump is realized in the future by a couple of minutes.
 - The time is synchronized again.
 - Users may expect that RTC to be updated as soon as possible, and not
   after 11 minutes (for the same reason, if a power loss occurs in this
   period).

Cancel periodic timer on any time jump (ADJ_SETOFFSET) greater than or
equal to 1s. The timer will be relaunched at the end of do_adjtimex() if
NTP is still considered synced. Otherwise the timer will be relaunched
later when NTP is synced. This way, when the time is synchronized again,
the RTC is updated after less than 2 seconds.

Signed-off-by: Benjamin ROBIN <dev@benjarobin.fr>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20240908140836.203911-1-dev@benjarobin.fr
2024-09-10 13:50:40 +02:00
Thomas Gleixner
2f7eedca6c Merge branch 'linus' into timers/core
To update with the latest fixes.
2024-09-10 13:49:53 +02:00
Kan Liang
a48a36b316 perf: Add PERF_EV_CAP_READ_SCOPE
Usually, an event can be read from any CPU of the scope. It doesn't need
to be read from the advertised CPU.

Add a new event cap, PERF_EV_CAP_READ_SCOPE. An event of a PMU with
scope can be read from any active CPU in the scope.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20240802151643.1691631-3-kan.liang@linux.intel.com
2024-09-10 11:44:13 +02:00
Kan Liang
4ba4f1afb6 perf: Generic hotplug support for a PMU with a scope
The perf subsystem assumes that the counters of a PMU are per-CPU. So
the user space tool reads a counter from each CPU in the system wide
mode. However, many PMUs don't have a per-CPU counter. The counter is
effective for a scope, e.g., a die or a socket. To address this, a
cpumask is exposed by the kernel driver to restrict to one CPU to stand
for a specific scope. In case the given CPU is removed,
the hotplug support has to be implemented for each such driver.

The codes to support the cpumask and hotplug are very similar.
- Expose a cpumask into sysfs
- Pickup another CPU in the same scope if the given CPU is removed.
- Invoke the perf_pmu_migrate_context() to migrate to a new CPU.
- In event init, always set the CPU in the cpumask to event->cpu

Similar duplicated codes are implemented for each such PMU driver. It
would be good to introduce a generic infrastructure to avoid such
duplication.

5 popular scopes are implemented here, core, die, cluster, pkg, and
the system-wide. The scope can be set when a PMU is registered. If so, a
"cpumask" is automatically exposed for the PMU.

The "cpumask" is from the perf_online_<scope>_mask, which is to track
the active CPU for each scope. They are set when the first CPU of the
scope is online via the generic perf hotplug support. When a
corresponding CPU is removed, the perf_online_<scope>_mask is updated
accordingly and the PMU will be moved to a new CPU from the same scope
if possible.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20240802151643.1691631-2-kan.liang@linux.intel.com
2024-09-10 11:44:12 +02:00
Andy Shevchenko
4e378158e5 tracing: Drop unused helper function to fix the build
A helper function defined but not used. This, in particular,
prevents kernel builds with clang, `make W=1` and CONFIG_WERROR=y:

kernel/trace/trace.c:2229:19: error: unused function 'run_tracer_selftest' [-Werror,-Wunused-function]
 2229 | static inline int run_tracer_selftest(struct tracer *type)
      |                   ^~~~~~~~~~~~~~~~~~~

Fix this by dropping unused functions.

See also commit 6863f5643d ("kbuild: allow Clang to find unused static
inline functions for W=1 build").

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Bill Wendling <morbo@google.com>
Cc: Justin Stitt <justinstitt@google.com>
Link: https://lore.kernel.org/20240909105314.928302-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-09-09 16:04:25 -04:00
Steven Rostedt
af17814334 tracing/osnoise: Fix build when timerlat is not enabled
To fix some critical section races, the interface_lock was added to a few
locations. One of those locations was above where the interface_lock was
declared, so the declaration was moved up before that usage.
Unfortunately, where it was placed was inside a CONFIG_TIMERLAT_TRACER
ifdef block. As the interface_lock is used outside that config, this broke
the build when CONFIG_OSNOISE_TRACER was enabled but
CONFIG_TIMERLAT_TRACER was not.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: "Helena Anna" <helena.anna.dubel@intel.com>
Cc: "Luis Claudio R. Goncalves" <lgoncalv@redhat.com>
Cc: Tomas Glozar <tglozar@redhat.com>
Link: https://lore.kernel.org/20240909103231.23a289e2@gandalf.local.home
Fixes: e6a53481da ("tracing/timerlat: Only clear timer if a kthread exists")
Reported-by: "Bityutskiy, Artem" <artem.bityutskiy@intel.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-09-09 16:03:57 -04:00
Yu Liao
3e5b2e81f1 printk: Export match_devname_and_update_preferred_console()
When building serial_base as a module, modpost fails with the following
error message:

  ERROR: modpost: "match_devname_and_update_preferred_console"
  [drivers/tty/serial/serial_base.ko] undefined!

Export the symbol to allow using it from modules.

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202409071312.qlwtTOS1-lkp@intel.com/
Fixes: 12c91cec31 ("serial: core: Add serial_base_match_and_update_preferred_console()")
Signed-off-by: Yu Liao <liaoyu15@huawei.com>
Link: https://lore.kernel.org/r/20240909075652.747370-1-liaoyu15@huawei.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-09-09 17:35:06 +02:00
Anna-Maria Behnsen
bd7c8ff9fe treewide: Fix wrong singular form of jiffies in comments
There are several comments all over the place, which uses a wrong singular
form of jiffies.

Replace 'jiffie' by 'jiffy'. No functional change.

Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> # m68k
Link: https://lore.kernel.org/all/20240904-devel-anna-maria-b4-timers-flseep-v1-3-e98760256370@linutronix.de
2024-09-08 20:47:40 +02:00
Anna-Maria Behnsen
662a1bfb90 cpu: Use already existing usleep_range()
usleep_range() is a wrapper arount usleep_range_state() which hands in
TASK_UNTINTERRUPTIBLE as state argument.

Use already exising wrapper usleep_range(). No functional change.

Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/all/20240904-devel-anna-maria-b4-timers-flseep-v1-2-e98760256370@linutronix.de
2024-09-08 20:47:40 +02:00
Anna-Maria Behnsen
fe90c5ba88 timers: Rename next_expiry_recalc() to be unique
next_expiry_recalc is the name of a function as well as the name of a
struct member of struct timer_base. This might lead to confusion.

Rename next_expiry_recalc() to timer_recalc_next_expiry(). No functional
change.

Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/all/20240904-devel-anna-maria-b4-timers-flseep-v1-1-e98760256370@linutronix.de
2024-09-08 20:47:40 +02:00
Neeraj Upadhyay
355debb83b Merge branches 'context_tracking.15.08.24a', 'csd.lock.15.08.24a', 'nocb.09.09.24a', 'rcutorture.14.08.24a', 'rcustall.09.09.24a', 'srcu.12.08.24a', 'rcu.tasks.14.08.24a', 'rcu_scaling_tests.15.08.24a', 'fixes.12.08.24a' and 'misc.11.08.24a' into next.09.09.24a 2024-09-09 00:09:47 +05:30
Paul E. McKenney
1ecd9d68eb rcu: Defer printing stall-warning backtrace when holding rcu_node lock
The rcu_dump_cpu_stacks() holds the leaf rcu_node structure's ->lock
when dumping the stakcks of any CPUs stalling the current grace period.
This lock is held to prevent confusion that would otherwise occur when
the stalled CPU reported its quiescent state (and then went on to do
unrelated things) just as the backtrace NMI was heading towards it.

This has worked well, but on larger systems has recently been observed
to cause severe lock contention resulting in CSD-lock stalls and other
general unhappiness.

This commit therefore does printk_deferred_enter() before acquiring
the lock and printk_deferred_exit() after releasing it, thus deferring
the overhead of actually outputting the stack trace out of that lock's
critical section.

Reported-by: Rik van Riel <riel@surriel.com>
Suggested-by: Rik van Riel <riel@surriel.com>
Signed-off-by: "Paul E. McKenney" <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-09-09 00:06:44 +05:30
Frederic Weisbecker
7562eed272 rcu/nocb: Remove superfluous memory barrier after bypass enqueue
Pre-GP accesses performed by the update side must be ordered against
post-GP accesses performed by the readers. This is ensured by the
bypass or nocb locking on enqueue time, followed by the fully ordered
rnp locking initiated while callbacks are accelerated, and then
propagated throughout the whole GP lifecyle associated with the
callbacks.

Therefore the explicit barrier advertizing ordering between bypass
enqueue and rcuo wakeup is superfluous. If anything, it would even only
order the first bypass callback enqueue against the rcuo wakeup and
ignore all the subsequent ones.

Remove the needless barrier.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-09-09 00:05:26 +05:30
Frederic Weisbecker
1b022b8763 rcu/nocb: Conditionally wake up rcuo if not already waiting on GP
A callback enqueuer currently wakes up the rcuo kthread if it is adding
the first non-done callback of a CPU, whether the kthread is waiting on
a grace period or not (unless the CPU is offline).

This looks like a desired behaviour because then the rcuo kthread
doesn't wait for the end of the current grace period to handle the
callback. It is accelerated right away and assigned to the next grace
period. The GP kthread is notified about that fact and iterates with
the upcoming GP without sleeping in-between.

However this best-case scenario is contradicted by a few details,
depending on the situation:

1) If the callback is a non-bypass one queued with IRQs enabled, the
   wake up only occurs if no other pending callbacks are on the list.
   Therefore the theoretical "optimization" actually applies on rare
   occasions.

2) If the callback is a non-bypass one queued with IRQs disabled, the
   situation is similar with even more uncertainty due to the deferred
   wake up.

3) If the callback is lazy, a few jiffies don't make any difference.

4) If the callback is bypass, the wake up timer is programmed 2 jiffies
   ahead by rcuo in case the regular pending queue has been handled
   in the meantime. The rare storm of callbacks can otherwise wait for
   the currently elapsing grace period to be flushed and handled.

For all those reasons, the optimization is only theoretical and
occasional. Therefore it is reasonable that callbacks enqueuers only
wake up the rcuo kthread when it is not already waiting on a grace
period to complete.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-09-09 00:05:26 +05:30
Frederic Weisbecker
9139f93209 rcu/nocb: Fix RT throttling hrtimer armed from offline CPU
After a CPU is marked offline and until it reaches its final trip to
idle, rcuo has several opportunities to be woken up, either because
a callback has been queued in the meantime or because
rcutree_report_cpu_dead() has issued the final deferred NOCB wake up.

If RCU-boosting is enabled, RCU kthreads are set to SCHED_FIFO policy.
And if RT-bandwidth is enabled, the related hrtimer might be armed.
However this then happens after hrtimers have been migrated at the
CPUHP_AP_HRTIMERS_DYING stage, which is broken as reported by the
following warning:

 Call trace:
  enqueue_hrtimer+0x7c/0xf8
  hrtimer_start_range_ns+0x2b8/0x300
  enqueue_task_rt+0x298/0x3f0
  enqueue_task+0x94/0x188
  ttwu_do_activate+0xb4/0x27c
  try_to_wake_up+0x2d8/0x79c
  wake_up_process+0x18/0x28
  __wake_nocb_gp+0x80/0x1a0
  do_nocb_deferred_wakeup_common+0x3c/0xcc
  rcu_report_dead+0x68/0x1ac
  cpuhp_report_idle_dead+0x48/0x9c
  do_idle+0x288/0x294
  cpu_startup_entry+0x34/0x3c
  secondary_start_kernel+0x138/0x158

Fix this with waking up rcuo using an IPI if necessary. Since the
existing API to deal with this situation only handles swait queue, rcuo
is only woken up from offline CPUs if it's not already waiting on a
grace period. In the worst case some callbacks will just wait for a
grace period to complete before being assigned to a subsequent one.

Reported-by: "Cheng-Jui Wang (王正睿)" <Cheng-Jui.Wang@mediatek.com>
Fixes: 5c0930ccaa ("hrtimers: Push pending hrtimers away from outgoing CPU earlier")
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-09-09 00:05:26 +05:30