Commit Graph

48390 Commits

Author SHA1 Message Date
Thomas Weißschuh
76164ca0d1 vdso/vsyscall: Split up __arch_update_vsyscall() into __arch_update_vdso_clock()
The upcoming auxiliary clocks need this hook, too.
To separate the architecture hooks from the timekeeper internals, refactor
the hook to only operate on a single vDSO clock.

While at it, use a more robust #define for the hook override.

Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250701-vdso-auxclock-v1-3-df7d9f87b9b8@linutronix.de
2025-07-09 11:52:34 +02:00
Thomas Weißschuh
6fedaf682a vdso/vsyscall: Introduce a helper to fill clock configurations
The logic to configure a 'struct vdso_clock' from a
'struct tk_read_base' is copied two times.
Split it into a shared function to reduce the duplication,
especially as another user will be added for auxiliary clocks.

Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250701-vdso-auxclock-v1-2-df7d9f87b9b8@linutronix.de
2025-07-09 11:52:34 +02:00
Thomas Gleixner
068f7b64bf Merge v6.16-rc2 into timers/ptp
to pick up the __GENMASK() fix, otherwise the AUX clock VDSO patches fail
to compile for compat.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2025-07-09 11:51:34 +02:00
Thomas Gleixner
a6d9638d4d Merge tag 'ktime-get-clock-ts64-for-ptp' into timers/ptp
Pull the base implementation of ktime_get_clock_ts64() for PTP, which
contains a temporary CLOCK_AUX* workaround. That was created to allow
integration of depending changes into the networking tree. The workaround
is going to be removed in a subsequent change in the timekeeping tree.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2025-07-03 14:35:53 +02:00
Thomas Gleixner
5b605dbee0 timekeeping: Provide ktime_get_clock_ts64()
PTP implements an inline switch case for taking timestamps from various
POSIX clock IDs, which already consumes quite some text space. Expanding it
for auxiliary clocks really becomes too big for inlining.

Provide a out of line version. 

The function invalidates the timestamp in case the clock is invalid. The
invalidation allows to implement a validation check without the need to
propagate a return value through deep existing call chains.

Due to merge logistics this temporarily defines CLOCK_AUX[_LAST] if
undefined, so that the plain branch, which does not contain any of the core
timekeeper changes, can be pulled into the networking tree as prerequisite
for the PTP side changes. These temporary defines are removed after that
branch is merged into the tip::timers/ptp branch. That way the result in
-next or upstream in the next merge window has zero dependencies.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20250701132628.357686408@linutronix.de
2025-07-03 14:18:06 +02:00
Thomas Gleixner
7b95663a3d timekeeping: Provide interface to control auxiliary clocks
Auxiliary clocks are disabled by default and attempts to access them
fail.

Provide an interface to enable/disable them at run-time.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20250625183758.444626478@linutronix.de
2025-06-27 20:13:13 +02:00
Thomas Gleixner
e6d4c00719 timekeeping: Provide update for auxiliary timekeepers
Update the auxiliary timekeepers periodically. For now this is tied to the system
timekeeper update from the tick. This might be revisited and moved out of the tick.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20250625183758.382451331@linutronix.de
2025-06-27 20:13:13 +02:00
Thomas Gleixner
ecf3e70304 timekeeping: Provide adjtimex() for auxiliary clocks
The behaviour is close to clock_adtime(CLOCK_REALTIME) with the
following differences:

  1) ADJ_SETOFFSET adjusts the auxiliary clock offset
  
  2) ADJ_TAI is not supported

  3) Leap seconds are not supported

  4) PPS is not supported

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20250625183758.317946543@linutronix.de
2025-06-27 20:13:13 +02:00
Thomas Gleixner
4eca49d0b6 timekeeping: Prepare do_adtimex() for auxiliary clocks
Exclude ADJ_TAI, leap seconds and PPS functionality as they make no sense
in the context of auxiliary clocks and provide a time stamp based on the
actual clock.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20250625183758.253203783@linutronix.de
2025-06-27 20:13:13 +02:00
Thomas Gleixner
775f71ebed timekeeping: Make do_adjtimex() reusable
Split out the actual functionality of adjtimex() and make do_adjtimex() a
wrapper which feeds the core timekeeper into it and handles the result
including audit at the call site.

This allows to reuse the actual functionality for auxiliary clocks.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20250625183758.187322876@linutronix.de
2025-06-27 20:13:13 +02:00
Thomas Gleixner
2c8aea59c2 timekeeping: Add auxiliary clock support to __timekeeping_inject_offset()
Redirect the relative offset adjustment to the auxiliary clock offset
instead of modifying CLOCK_REALTIME, which has no meaning in context of
these clocks.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20250625183758.124057787@linutronix.de
2025-06-27 20:13:13 +02:00
Thomas Gleixner
e8db3a5579 timekeeping: Make timekeeping_inject_offset() reusable
Split out the inner workings for auxiliary clock support and feed the core time
keeper into it.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20250625183758.059934561@linutronix.de
2025-06-27 20:13:13 +02:00
Thomas Gleixner
60ecc26ec5 timekeeping: Provide time setter for auxiliary clocks
Add clock_settime(2) support for auxiliary clocks. The function affects the
AUX offset which is added to the "monotonic" clock readout of these clocks.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20250625183757.995688714@linutronix.de
2025-06-27 20:13:12 +02:00
Thomas Gleixner
606424bf4f timekeeping: Add minimal posix-timers support for auxiliary clocks
Provide clock_getres(2) and clock_gettime(2) for auxiliary clocks.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20250625183757.932220594@linutronix.de
2025-06-27 20:13:12 +02:00
Thomas Gleixner
05bc6e6290 timekeeping: Provide time getters for auxiliary clocks
Provide interfaces similar to the ktime_get*() family which provide access
to the auxiliary clocks.

These interfaces have a boolean return value, which indicates whether the
accessed clock is valid or not.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20250625183757.868342628@linutronix.de
2025-06-27 20:13:12 +02:00
Thomas Gleixner
9f7729480a timekeeping: Update auxiliary timekeepers on clocksource change
Propagate a system clocksource change to the auxiliary timekeepers so that
they can pick up the new clocksource.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20250625183757.803890875@linutronix.de
2025-06-27 20:13:12 +02:00
Thomas Gleixner
c85f5ab608 ntp: Use ktime_get_ntp_seconds()
Use ktime_get_ntp_seconds() to prepare for auxiliary clocks so that
the readout becomes per timekeeper.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250519083026.472512636@linutronix.de
2025-06-19 14:28:24 +02:00
Thomas Gleixner
ffa0519baa timekeeping: Provide ktime_get_ntp_seconds()
ntp_adjtimex() requires access to the actual time keeper per timekeeper
ID. Provide an interface.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250519083026.411809421@linutronix.de
2025-06-19 14:28:23 +02:00
Anna-Maria Behnsen
22c62b9a84 timekeeping: Introduce auxiliary timekeepers
Provide timekeepers for auxiliary clocks and initialize them during
boot.

Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20250519083026.350061049@linutronix.de
2025-06-19 14:28:23 +02:00
Thomas Gleixner
6168024604 timekeeping: Add clock_valid flag to timekeeper
In preparation for supporting independent auxiliary timekeepers, add a
clock valid field and set it to true for the system timekeeper.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20250519083026.287145536@linutronix.de
2025-06-19 14:28:23 +02:00
Thomas Gleixner
8c782acd3f timekeeping: Prepare timekeeping_update_from_shadow()
Don't invoke the VDSO and paravirt updates when utilized for auxiliary
clocks. This is a temporary workaround until the VDSO and paravirt
interfaces have been worked out.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250519083026.223876435@linutronix.de
2025-06-19 14:28:23 +02:00
Anna-Maria Behnsen
926ad47516 timekeeping: Make __timekeeping_advance() reusable
In __timekeeping_advance() the pointer to struct tk_data is hardcoded by the
use of &tk_core. As long as there is only a single timekeeper (tk_core),
this is not a problem. But when __timekeeping_advance() will be reused for
per auxiliary timekeepers, __timekeeping_advance() needs to be generalized.

Add a pointer to struct tk_data as function argument of
__timekeeping_advance() and adapt all call sites.

Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20250519083026.160967312@linutronix.de
2025-06-19 14:28:23 +02:00
Thomas Gleixner
c7ebfbc440 ntp: Rename __do_adjtimex() to ntp_adjtimex()
Clean up the name space. No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20250519083026.095637820@linutronix.de
2025-06-19 14:28:23 +02:00
Thomas Gleixner
5ffa25f573 ntp: Add timekeeper ID arguments to public functions
In preparation for supporting auxiliary POSIX clocks, add a timekeeper ID
to the relevant functions.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20250519083026.032425931@linutronix.de
2025-06-19 14:28:23 +02:00
Thomas Gleixner
8515714b0f ntp: Add support for auxiliary timekeepers
If auxiliary clocks are enabled, provide an array of NTP data so that the
auxiliary timekeepers can be steered independently of the core timekeeper.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20250519083025.969000914@linutronix.de
2025-06-19 14:28:22 +02:00
Anna-Maria Behnsen
9094c72c3d time: Introduce auxiliary POSIX clocks
To support auxiliary timekeeping and the related user space interfaces,
it's required to define a clock ID range for them.

Reserve 8 auxiliary clock IDs after the regular timekeeping clock ID space.

This is the maximum number of auxiliary clocks the kernel can support. The actual
number of supported clocks depends obviously on the presence of related devices
and might be constraint by the available VDSO space.

Add the corresponding timekeeper IDs as well.

Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20250519083025.905800695@linutronix.de
2025-06-19 14:28:22 +02:00
Anna-Maria Behnsen
f12b45862c timekeeping: Introduce timekeeper ID
As long as there is only a single timekeeper, there is no need to clarify
which timekeeper is used. But with the upcoming reusage of the timekeeper
infrastructure for auxiliary clock timekeepers, an ID is required to
differentiate.

Introduce an enum for timekeeper IDs, introduce a field in struct tk_data
to store this timekeeper id and add also initialization. The id struct
field is added at the end of the second cachline, as there is a 4 byte hole
anyway.

Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20250519083025.842476378@linutronix.de
2025-06-19 14:28:22 +02:00
Thomas Gleixner
7e55b6ba1f timekeeping: Avoid double notification in do_adjtimex()
Consolidate do_adjtimex() so that it does not notify about clock changes
twice.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250519083025.779267274@linutronix.de
2025-06-19 14:28:22 +02:00
Thomas Gleixner
506a54a031 timekeeping: Cleanup kernel doc of __ktime_get_real_seconds()
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20250519083025.715836017@linutronix.de
2025-06-19 14:28:22 +02:00
Thomas Gleixner
990518eb3a timekeeping: Remove hardcoded access to tk_core
This was overlooked in the initial conversion. Use the provided pointer to
access the shadow timekeeper.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20250519083025.652611452@linutronix.de
2025-06-19 14:28:22 +02:00
Oleg Nesterov
f90fff1e15 posix-cpu-timers: fix race between handle_posix_cpu_timers() and posix_cpu_timer_del()
If an exiting non-autoreaping task has already passed exit_notify() and
calls handle_posix_cpu_timers() from IRQ, it can be reaped by its parent
or debugger right after unlock_task_sighand().

If a concurrent posix_cpu_timer_del() runs at that moment, it won't be
able to detect timer->it.cpu.firing != 0: cpu_timer_task_rcu() and/or
lock_task_sighand() will fail.

Add the tsk->exit_state check into run_posix_cpu_timers() to fix this.

This fix is not needed if CONFIG_POSIX_CPU_TIMERS_TASK_WORK=y, because
exit_task_work() is called before exit_notify(). But the check still
makes sense, task_work_add(&tsk->posix_cputimers_work.work) will fail
anyway in this case.

Cc: stable@vger.kernel.org
Reported-by: Benoît Sevens <bsevens@google.com>
Fixes: 0bdd2ed413 ("sched: run_posix_cpu_timers: Don't check ->exit_state, use lock_task_sighand()")
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2025-06-13 10:55:49 -07:00
Steven Rostedt
8a157d8a00 tracing: Do not free "head" on error path of filter_free_subsystem_filters()
The variable "head" is allocated and initialized as a list before
allocating the first "item" for the list. If the allocation of "item"
fails, it frees "head" and then jumps to the label "free_now" which will
process head and free it.

This will cause a UAF of "head", and it doesn't need to free it before
jumping to the "free_now" label as that code will free it.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250610093348.33c5643a@gandalf.local.home
Fixes: a9d0aab5eb ("tracing: Fix regression of filter waiting a long time on RCU synchronization")
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/r/202506070424.lCiNreTI-lkp@intel.com/
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-06-10 09:39:58 -04:00
Linus Torvalds
be54f8c558 Merge tag 'timers-cleanups-2025-06-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer cleanup from Thomas Gleixner:
 "The delayed from_timer() API cleanup:

  The renaming to the timer_*() namespace was delayed due massive
  conflicts against Linux-next. Now that everything is upstream finish
  the conversion"

* tag 'timers-cleanups-2025-06-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  treewide, timers: Rename from_timer() to timer_container_of()
2025-06-08 11:33:00 -07:00
Linus Torvalds
538c429a4b Merge tag 'trace-v6.16-3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull more tracing fixes from Steven Rostedt:

 - Fix regression of waiting a long time on updating trace event filters

   When the faultable trace points were added, it needed task trace RCU
   synchronization.

   This was added to the tracepoint_synchronize_unregister() function.
   The filter logic always called this function whenever it updated the
   trace event filters before freeing the old filters. This increased
   the time of "trace-cmd record" from taking 13 seconds to running over
   2 minutes to complete.

   Move the freeing of the filters to call_rcu*() logic, which brings
   the time back down to 13 seconds.

 - Fix ring_buffer_subbuf_order_set() error path lock protection

   The error path of the ring_buffer_subbuf_order_set() released the
   mutex too early and allowed subsequent accesses to setting the
   subbuffer size to corrupt the data and cause a bug.

   By moving the mutex locking to the end of the error path, it prevents
   the reentrant access to the critical data and also allows the
   function to convert the taking of the mutex over to the guard()
   logic.

 - Remove unused power management clock events

   The clock events were added in 2010 for power management. In 2011 arm
   used them. In 2013 the code they were used in was removed. These
   events have been wasting memory since then.

 - Fix sparse warnings

   There was a few places that sparse warned about trace_events_filter.c
   where file->filter was referenced directly, but it is annotated with
   an __rcu tag. Use the helper functions and fix them up to use
   rcu_dereference() properly.

* tag 'trace-v6.16-3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing: Add rcu annotation around file->filter accesses
  tracing: PM: Remove unused clock events
  ring-buffer: Fix buffer locking in ring_buffer_subbuf_order_set()
  tracing: Fix regression of filter waiting a long time on RCU synchronization
2025-06-08 08:19:01 -07:00
Ingo Molnar
41cb08555c treewide, timers: Rename from_timer() to timer_container_of()
Move this API to the canonical timer_*() namespace.

[ tglx: Redone against pre rc1 ]

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/aB2X0jCKQO56WdMt@gmail.com
2025-06-08 09:07:37 +02:00
Linus Torvalds
8630c59e99 Merge tag 'kbuild-v6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild updates from Masahiro Yamada:

 - Add support for the EXPORT_SYMBOL_GPL_FOR_MODULES() macro, which
   exports a symbol only to specified modules

 - Improve ABI handling in gendwarfksyms

 - Forcibly link lib-y objects to vmlinux even if CONFIG_MODULES=n

 - Add checkers for redundant or missing <linux/export.h> inclusion

 - Deprecate the extra-y syntax

 - Fix a genksyms bug when including enum constants from *.symref files

* tag 'kbuild-v6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (28 commits)
  genksyms: Fix enum consts from a reference affecting new values
  arch: use always-$(KBUILD_BUILTIN) for vmlinux.lds
  kbuild: set y instead of 1 to KBUILD_{BUILTIN,MODULES}
  efi/libstub: use 'targets' instead of extra-y in Makefile
  module: make __mod_device_table__* symbols static
  scripts/misc-check: check unnecessary #include <linux/export.h> when W=1
  scripts/misc-check: check missing #include <linux/export.h> when W=1
  scripts/misc-check: add double-quotes to satisfy shellcheck
  kbuild: move W=1 check for scripts/misc-check to top-level Makefile
  scripts/tags.sh: allow to use alternative ctags implementation
  kconfig: introduce menu type enum
  docs: symbol-namespaces: fix reST warning with literal block
  kbuild: link lib-y objects to vmlinux forcibly even when CONFIG_MODULES=n
  tinyconfig: enable CONFIG_LD_DEAD_CODE_DATA_ELIMINATION
  docs/core-api/symbol-namespaces: drop table of contents and section numbering
  modpost: check forbidden MODULE_IMPORT_NS("module:") at compile time
  kbuild: move kbuild syntax processing to scripts/Makefile.build
  Makefile: remove dependency on archscripts for header installation
  Documentation/kbuild: Add new gendwarfksyms kABI rules
  Documentation/kbuild: Drop section numbers
  ...
2025-06-07 10:05:35 -07:00
Steven Rostedt
549e914c96 tracing: Add rcu annotation around file->filter accesses
Running sparse on trace_events_filter.c triggered several warnings about
file->filter being accessed directly even though it's annotated with __rcu.

Add rcu_dereference() around it and shuffle the logic slightly so that
it's always referenced via accessor functions.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250607102821.6c7effbf@gandalf.local.home
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-06-07 10:31:22 -04:00
Linus Torvalds
d3c82f618a Merge tag 'mm-hotfixes-stable-2025-06-06-16-02' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
 "13 hotfixes.

  6 are cc:stable and the remainder address post-6.15 issues or aren't
  considered necessary for -stable kernels. 11 are for MM"

* tag 'mm-hotfixes-stable-2025-06-06-16-02' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  kernel/rcu/tree_stall: add /sys/kernel/rcu_stall_count
  MAINTAINERS: add mm swap section
  kmsan: test: add module description
  MAINTAINERS: add tlb trace events to MMU GATHER AND TLB INVALIDATION
  mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race
  mm/hugetlb: unshare page tables during VMA split, not before
  MAINTAINERS: add Alistair as reviewer of mm memory policy
  iov_iter: use iov_offset for length calculation in iov_iter_aligned_bvec
  mm/mempolicy: fix incorrect freeing of wi_kobj
  alloc_tag: handle module codetag load errors as module load failures
  mm/madvise: handle madvise_lock() failure during race unwinding
  mm: fix vmstat after removing NR_BOUNCE
  KVM: s390: rename PROT_NONE to PROT_TYPE_DUMMY
2025-06-06 21:45:45 -07:00
Linus Torvalds
119b1e61a7 Merge tag 'riscv-for-linus-6.16-mw1' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux
Pull RISC-V updates from Palmer Dabbelt:

 - Support for the FWFT SBI extension, which is part of SBI 3.0 and a
   dependency for many new SBI and ISA extensions

 - Support for getrandom() in the VDSO

 - Support for mseal

 - Optimized routines for raid6 syndrome and recovery calculations

 - kexec_file() supports loading Image-formatted kernel binaries

 - Improvements to the instruction patching framework to allow for
   atomic instruction patching, along with rules as to how systems need
   to behave in order to function correctly

 - Support for a handful of new ISA extensions: Svinval, Zicbop, Zabha,
   some SiFive vendor extensions

 - Various fixes and cleanups, including: misaligned access handling,
   perf symbol mangling, module loading, PUD THPs, and improved uaccess
   routines

* tag 'riscv-for-linus-6.16-mw1' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: (69 commits)
  riscv: uaccess: Only restore the CSR_STATUS SUM bit
  RISC-V: vDSO: Wire up getrandom() vDSO implementation
  riscv: enable mseal sysmap for RV64
  raid6: Add RISC-V SIMD syndrome and recovery calculations
  riscv: mm: Add support for Svinval extension
  RISC-V: Documentation: Add enough title underlines to CMODX
  riscv: Improve Kconfig help for RISCV_ISA_V_PREEMPTIVE
  MAINTAINERS: Update Atish's email address
  riscv: uaccess: do not do misaligned accesses in get/put_user()
  riscv: process: use unsigned int instead of unsigned long for put_user()
  riscv: make unsafe user copy routines use existing assembly routines
  riscv: hwprobe: export Zabha extension
  riscv: Make regs_irqs_disabled() more clear
  perf symbols: Ignore mapping symbols on riscv
  RISC-V: Kconfig: Fix help text of CMDLINE_EXTEND
  riscv: module: Optimize PLT/GOT entry counting
  riscv: Add support for PUD THP
  riscv: xchg: Prefetch the destination word for sc.w
  riscv: Add ARCH_HAS_PREFETCH[W] support with Zicbop
  riscv: Add support for Zicbop
  ...
2025-06-06 18:05:18 -07:00
Dmitry Antipov
40ee2afafc ring-buffer: Fix buffer locking in ring_buffer_subbuf_order_set()
Enlarge the critical section in ring_buffer_subbuf_order_set() to
ensure that error handling takes place with per-buffer mutex held,
thus preventing list corruption and other concurrency-related issues.

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com>
Link: https://lore.kernel.org/20250606112242.1510605-1-dmantipov@yandex.ru
Reported-by: syzbot+05d673e83ec640f0ced9@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=05d673e83ec640f0ced9
Fixes: f9b94daa54 ("ring-buffer: Set new size of the ring buffer sub page")
Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-06-06 20:25:55 -04:00
Steven Rostedt
a9d0aab5eb tracing: Fix regression of filter waiting a long time on RCU synchronization
When faultable trace events were added, a trace event may no longer use
normal RCU to synchronize but instead used synchronize_rcu_tasks_trace().
This synchronization takes a much longer time to synchronize.

The filter logic would free the filters by calling
tracepoint_synchronize_unregister() after it unhooked the filter strings
and before freeing them. With this function now calling
synchronize_rcu_tasks_trace() this increased the time to free a filter
tremendously. On a PREEMPT_RT system, it was even more noticeable.

 # time trace-cmd record -p function sleep 1
 [..]
 real	2m29.052s
 user	0m0.244s
 sys	0m20.136s

As trace-cmd would clear out all the filters before recording, it could
take up to 2 minutes to do a recording of "sleep 1".

To find out where the issues was:

 ~# trace-cmd sqlhist -e -n sched_stack  select start.prev_state as state, end.next_comm as comm, TIMESTAMP_DELTA_USECS as delta,  start.STACKTRACE as stack from sched_switch as start join sched_switch as end on start.prev_pid = end.next_pid

Which will produce the following commands (and -e will also execute them):

 echo 's:sched_stack s64 state; char comm[16]; u64 delta; unsigned long stack[];' >> /sys/kernel/tracing/dynamic_events
 echo 'hist:keys=prev_pid:__arg_18057_2=prev_state,__arg_18057_4=common_timestamp.usecs,__arg_18057_7=common_stacktrace' >> /sys/kernel/tracing/events/sched/sched_switch/trigger
 echo 'hist:keys=next_pid:__state_18057_1=$__arg_18057_2,__comm_18057_3=next_comm,__delta_18057_5=common_timestamp.usecs-$__arg_18057_4,__stack_18057_6=$__arg_18057_7:onmatch(sched.sched_switch).trace(sched_stack,$__state_18057_1,$__comm_18057_3,$__delta_18057_5,$__stack_18057_6)' >> /sys/kernel/tracing/events/sched/sched_switch/trigger

The above creates a synthetic event that creates a stack trace when a task
schedules out and records it with the time it scheduled back in. Basically
the time a task is off the CPU. It also records the state of the task when
it left the CPU (running, blocked, sleeping, etc). It also saves the comm
of the task as "comm" (needed for the next command).

~# echo 'hist:keys=state,stack.stacktrace:vals=delta:sort=state,delta if comm == "trace-cmd" &&  state & 3' > /sys/kernel/tracing/events/synthetic/sched_stack/trigger

The above creates a histogram with buckets per state, per stack, and the
value of the total time it was off the CPU for that stack trace. It filters
on tasks with "comm == trace-cmd" and only the sleeping and blocked states
(1 - sleeping, 2 - blocked).

~# trace-cmd record -p function sleep 1

~# cat /sys/kernel/tracing/events/synthetic/sched_stack/hist | tail -18
{ state:          2, stack.stacktrace         __schedule+0x1545/0x3700
         schedule+0xe2/0x390
         schedule_timeout+0x175/0x200
         wait_for_completion_state+0x294/0x440
         __wait_rcu_gp+0x247/0x4f0
         synchronize_rcu_tasks_generic+0x151/0x230
         apply_subsystem_event_filter+0xa2b/0x1300
         subsystem_filter_write+0x67/0xc0
         vfs_write+0x1e2/0xeb0
         ksys_write+0xff/0x1d0
         do_syscall_64+0x7b/0x420
         entry_SYSCALL_64_after_hwframe+0x76/0x7e
} hitcount:        237  delta:   99756288  <<--------------- Delta is 99 seconds!

Totals:
    Hits: 525
    Entries: 21
    Dropped: 0

This shows that this particular trace waited for 99 seconds on
synchronize_rcu_tasks() in apply_subsystem_event_filter().

In fact, there's a lot of places in the filter code that spends a lot of
time waiting for synchronize_rcu_tasks_trace() in order to free the
filters.

Add helper functions that will use call_rcu*() variants to asynchronously
free the filters. This brings the timings back to normal:

 # time trace-cmd record -p function sleep 1
 [..]
 real	0m14.681s
 user	0m0.335s
 sys	0m28.616s

And the histogram also shows this:

~# cat /sys/kernel/tracing/events/synthetic/sched_stack/hist | tail -21
{ state:          2, stack.stacktrace         __schedule+0x1545/0x3700
         schedule+0xe2/0x390
         schedule_timeout+0x175/0x200
         wait_for_completion_state+0x294/0x440
         __wait_rcu_gp+0x247/0x4f0
         synchronize_rcu_normal+0x3db/0x5c0
         tracing_reset_online_cpus+0x8f/0x1e0
         tracing_open+0x335/0x440
         do_dentry_open+0x4c6/0x17a0
         vfs_open+0x82/0x360
         path_openat+0x1a36/0x2990
         do_filp_open+0x1c5/0x420
         do_sys_openat2+0xed/0x180
         __x64_sys_openat+0x108/0x1d0
         do_syscall_64+0x7b/0x420
} hitcount:          2  delta:      77044

Totals:
    Hits: 55
    Entries: 28
    Dropped: 0

Where the total waiting time of synchronize_rcu_tasks_trace() is 77
milliseconds.

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Jan Kiszka <jan.kiszka@siemens.com>
Cc: Andreas Ziegler <ziegler.andreas@siemens.com>
Cc: Felix MOESSBAUER <felix.moessbauer@siemens.com>
Link: https://lore.kernel.org/20250606201936.1e3d09a9@batman.local.home
Reported-by: "Flot, Julien" <julien.flot@siemens.com>
Tested-by: Julien Flot <julien.flot@siemens.com>
Fixes: a363d27cdb ("tracing: Allow system call tracepoints to handle page faults")
Closes: https://lore.kernel.org/all/240017f656631c7dd4017aa93d91f41f653788ea.camel@siemens.com/
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-06-06 20:25:55 -04:00
Max Kellermann
2da20fd904 kernel/rcu/tree_stall: add /sys/kernel/rcu_stall_count
Expose a simple counter to userspace for monitoring tools.

(akpm: 2536c5c7d6 added the documentation but the code changes were lost)

Link: https://lkml.kernel.org/r/20250504180831.4190860-3-max.kellermann@ionos.com
Fixes: 2536c5c7d6 ("kernel/rcu/tree_stall: add /sys/kernel/rcu_stall_count")
Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
Cc: Core Minyard <cminyard@mvista.com>
Cc: Doug Anderson <dianders@chromium.org>
Cc: Joel Granados <joel.granados@kernel.org>
Cc: Max Kellermann <max.kellermann@ionos.com>
Cc: Song Liu <song@kernel.org>
Cc: Sourabh Jain <sourabhjain@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-05 22:02:25 -07:00
Suren Baghdasaryan
044d2aee6c alloc_tag: handle module codetag load errors as module load failures
Failures inside codetag_load_module() are currently ignored.  As a result
an error there would not cause a module load failure and freeing of the
associated resources.  Correct this behavior by propagating the error code
to the caller and handling possible errors.  With this change, error to
allocate percpu counters, which happens at this stage, will not be ignored
and will cause a module load failure and freeing of resources.  With this
change we also do not need to disable memory allocation profiling when
this error happens, instead we fail to load the module.

Link: https://lkml.kernel.org/r/20250521160602.1940771-1-surenb@google.com
Fixes: 10075262888b ("alloc_tag: allocate percpu counters for module tags dynamically")
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reported-by: Casey Chen <cachen@purestorage.com>
Closes: https://lore.kernel.org/all/20250520231620.15259-1-cachen@purestorage.com/
Cc: Daniel Gomez <da.gomez@samsung.com>
Cc: David Wang <00107082@163.com>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Luis Chamberalin <mcgrof@kernel.org>
Cc: Petr Pavlu <petr.pavlu@suse.com>
Cc: Sami Tolvanen <samitolvanen@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-05 22:02:23 -07:00
Palmer Dabbelt
2670a39b1e Merge tag 'riscv-mw2-6.16-rc1' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/alexghiti/linux into for-next
riscv patches for 6.16-rc1, part 2

* Performance improvements
  - Add support for vdso getrandom
  - Implement raid6 calculations using vectors
  - Introduce svinval tlb invalidation

* Cleanup
  - A bunch of deduplication of the macros we use for manipulating instructions

* Misc
  - Introduce a kunit test for kprobes
  - Add support for mseal as riscv fits the requirements (thanks to Lorenzo for making sure of that :))

[Palmer: There was a rebase between part 1 and part 2, so I've had to do
some more git surgery here... at least two rounds of surgery...]

* alex-pr-2: (866 commits)
  RISC-V: vDSO: Wire up getrandom() vDSO implementation
  riscv: enable mseal sysmap for RV64
  raid6: Add RISC-V SIMD syndrome and recovery calculations
  riscv: mm: Add support for Svinval extension
  riscv: Add kprobes KUnit test
  riscv: kprobes: Remove duplication of RV_EXTRACT_ITYPE_IMM
  riscv: kprobes: Remove duplication of RV_EXTRACT_UTYPE_IMM
  riscv: kprobes: Remove duplication of RV_EXTRACT_RD_REG
  riscv: kprobes: Remove duplication of RVC_EXTRACT_BTYPE_IMM
  riscv: kprobes: Remove duplication of RVC_EXTRACT_C2_RS1_REG
  riscv: kproves: Remove duplication of RVC_EXTRACT_JTYPE_IMM
  riscv: kprobes: Remove duplication of RV_EXTRACT_BTYPE_IMM
  riscv: kprobes: Remove duplication of RV_EXTRACT_RS1_REG
  riscv: kprobes: Remove duplication of RV_EXTRACT_JTYPE_IMM
  riscv: kprobes: Move branch_funct3 to insn.h
  riscv: kprobes: Move branch_rs2_idx to insn.h
  Linux 6.15-rc6
  Input: xpad - fix xpad_device sorting
  Input: xpad - add support for several more controllers
  Input: xpad - fix Share button on Xbox One controllers
  ...
2025-06-05 14:03:16 -07:00
Masahiro Yamada
81a93bf93f tinyconfig: enable CONFIG_LD_DEAD_CODE_DATA_ELIMINATION
This CONFIG option, if supported by the architecture, helps reduce the
size of vmlinux.

For example, the size of vmlinux with ARCH=arm tinyconfig decreases as
follows:

   text	   data	    bss	    dec	    hex	filename
 631684	 104500	  18176	 754360	  b82b8	vmlinux.before
 455316	  93404	  15472	 564192	  89be0	vmlinux.after

Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2025-06-06 05:40:24 +09:00
Andy Chiu
500e626c4a kernel: ftrace: export ftrace_sync_ipi
The following ftrace patch for riscv uses a data store to update ftrace
function. Therefore, a romote fence is required to order it against
function_trace_op updates. The mechanism is similar to the fence between
function_trace_op and update_ftrace_func in the generic ftrace, so we
leverage the same ftrace_sync_ipi function.

[ alex: Fix build warning when !CONFIG_DYNAMIC_FTRACE ]

Signed-off-by: Andy Chiu <andybnac@gmail.com>
Link: https://lore.kernel.org/r/20250407180838.42877-4-andybnac@gmail.com
Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Signed-off-by: Palmer Dabbelt <palmer@dabbelt.com>
2025-06-05 11:09:24 -07:00
Linus Torvalds
16b70698aa Merge tag 'sched_ext-for-6.16-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext fixes from Tejun Heo:
 "Two fixes in the built-in idle selection helpers:

   - Fix prev_cpu handling to guarantee that idle selection never
     returns a CPU that's not allowed

   - Skip cross-node search when !NUMA which could lead to infinite
     looping due to a bug in NUMA iterator"

* tag 'sched_ext-for-6.16-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
  sched_ext: idle: Skip cross-node search with !CONFIG_NUMA
  sched_ext: idle: Properly handle invalid prev_cpu during idle selection
2025-06-04 12:07:16 -07:00
Linus Torvalds
70087d2200 Merge tag 'trace-v6.16-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:

 - Fix UAF in module unload in ftrace when there's a bug in the module

   If a module is buggy and triggers ftrace_disable which is set when an
   anomaly is detected, when it gets unloaded it doesn't free the hooks
   into kallsyms, and when a kallsyms lookup is performed it may access
   the mod->modname field and crash via UAF.

   Fix this by still freeing the mod_maps that are attached to kallsyms
   on module unload regardless if ftrace_disable is set or not.

 - Do not bother allocating mod_maps for kallsyms if ftrace_disable is
   set

 - Remove unused trace events

   When a trace event or tracepoint is created but not used, it still
   creates the code and data structures needed for that trace event.
   This just wastes memory.

   Remove the trace events that are created but not used. This does not
   remove trace events that are created but are not used due configs not
   being set. That will be handled later. This only removes events that
   have no user under any config.

* tag 'trace-v6.16-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  fsdax: Remove unused trace events for dax insert mapping
  genirq/matrix: Remove unused irq_matrix_alloc_reserved tracepoint
  xdp: Remove unused mem_return_failed event
  ftrace: Don't allocate ftrace module map if ftrace is disabled
  ftrace: Fix UAF when lookup kallsym after ftrace disabled
2025-06-03 14:21:31 -07:00
Linus Torvalds
def5b099a6 Merge tag 'cgroup-for-6.16-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fix from Tejun Heo:
 "The rstat per-subsystem split change skipped per-cpu allocation on UP
  configs; however even on UP, depending on config options, the size of
  the percpu struct may not be zero leading to crashes.

  Fix it by conditionalizing the per-cpu area allocation and usage on
  the size of the per-cpu struct"

* tag 'cgroup-for-6.16-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: adjust criteria for rstat subsystem cpu lock access
2025-06-03 14:12:36 -07:00
Andrea Righi
9960be72a5 sched_ext: idle: Skip cross-node search with !CONFIG_NUMA
In the idle CPU selection logic, attempting cross-node searches adds
unnecessary complexity when CONFIG_NUMA is disabled.

Since there's no meaningful concept of nodes in this case, simplify the
logic by restricting the idle CPU search to the current node only.

Fixes: 48849271e6 ("sched_ext: idle: Per-node idle cpumasks")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-06-03 08:22:27 -10:00