Commit Graph

49605 Commits

Author SHA1 Message Date
Ihor Solodrai
5fc5d8fded bpf: Add bpf_dynptr_memset() kfunc
Currently there is no straightforward way to fill dynptr memory with a
value (most commonly zero). One can do it with bpf_dynptr_write(), but
a temporary buffer is necessary for that.

Implement bpf_dynptr_memset() - an analogue of memset() from libc.

Signed-off-by: Ihor Solodrai <isolodrai@meta.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20250702210309.3115903-2-isolodrai@meta.com
2025-07-03 15:21:20 -07:00
tuhaowen
4266e8fa56 PM: sleep: console: Fix the black screen issue
When the computer enters sleep status without a monitor
connected, the system switches the console to the virtual
terminal tty63(SUSPEND_CONSOLE).

If a monitor is subsequently connected before waking up,
the system skips the required VT restoration process
during wake-up, leaving the console on tty63 instead of
switching back to tty1.

To fix this issue, a global flag vt_switch_done is introduced
to record whether the system has successfully switched to
the suspend console via vt_move_to_console() during suspend.

If the switch was completed, vt_switch_done is set to 1.
Later during resume, this flag is checked to ensure that
the original console is restored properly by calling
vt_move_to_console(orig_fgconsole, 0).

This prevents scenarios where the resume logic skips console
restoration due to incorrect detection of the console state,
especially when a monitor is reconnected before waking up.

Signed-off-by: tuhaowen <tuhaowen@uniontech.com>
Link: https://patch.msgid.link/20250611032345.29962-1-tuhaowen@uniontech.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-07-03 15:58:33 +02:00
Thomas Gleixner
858e65af91 irqdomain: Add device pointer to irq_domain_info and msi_domain_info
Add device pointer to irq_domain_info and msi_domain_info, so that the device
can be specified at domain creation time.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/943e52403b20cf13c320d55bd4446b4562466aab.1750860131.git.namcao@linutronix.de
2025-07-03 15:49:24 +02:00
Paolo Abeni
4f38a6db7b Merge tag 'ktime-get-clock-ts64-for-ptp' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Base implementation for PTP with a temporary CLOCK_AUX* workaround to
allow integration of depending changes into the networking tree.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-03 15:35:07 +02:00
Thomas Gleixner
a6d9638d4d Merge tag 'ktime-get-clock-ts64-for-ptp' into timers/ptp
Pull the base implementation of ktime_get_clock_ts64() for PTP, which
contains a temporary CLOCK_AUX* workaround. That was created to allow
integration of depending changes into the networking tree. The workaround
is going to be removed in a subsequent change in the timekeeping tree.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2025-07-03 14:35:53 +02:00
Thomas Gleixner
5b605dbee0 timekeeping: Provide ktime_get_clock_ts64()
PTP implements an inline switch case for taking timestamps from various
POSIX clock IDs, which already consumes quite some text space. Expanding it
for auxiliary clocks really becomes too big for inlining.

Provide a out of line version. 

The function invalidates the timestamp in case the clock is invalid. The
invalidation allows to implement a validation check without the need to
propagate a return value through deep existing call chains.

Due to merge logistics this temporarily defines CLOCK_AUX[_LAST] if
undefined, so that the plain branch, which does not contain any of the core
timekeeper changes, can be pulled into the networking tree as prerequisite
for the PTP side changes. These temporary defines are removed after that
branch is merged into the tip::timers/ptp branch. That way the result in
-next or upstream in the next merge window has zero dependencies.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20250701132628.357686408@linutronix.de
2025-07-03 14:18:06 +02:00
Peter Zijlstra
ba677dbe77 perf: Revert to requiring CAP_SYS_ADMIN for uprobes
Jann reports that uprobes can be used destructively when used in the
middle of an instruction. The kernel only verifies there is a valid
instruction at the requested offset, but due to variable instruction
length cannot determine if this is an instruction as seen by the
intended execution stream.

Additionally, Mark Rutland notes that on architectures that mix data
in the text segment (like arm64), a similar things can be done if the
data word is 'mistaken' for an instruction.

As such, require CAP_SYS_ADMIN for uprobes.

Fixes: c9e0924e5c ("perf/core: open access to probes for CAP_PERFMON privileged process")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/CAG48ez1n4520sq0XrWYDHKiKxE_+WCfAK+qt9qkY4ZiBGmL-5g@mail.gmail.com
2025-07-03 10:33:55 +02:00
Paul Chaignon
65fdafd676 bpf: Avoid warning on multiple referenced args in call
The description of full helper calls in syzkaller [1] and the addition of
kernel warnings in commit 0df1a55afa ("bpf: Warn on internal verifier
errors") allowed syzbot to reach a verifier state that was thought to
indicate a verifier bug [2]:

    12: (85) call bpf_tcp_raw_gen_syncookie_ipv4#204
    verifier bug: more than one arg with ref_obj_id R2 2 2

This error can be reproduced with the program from the previous commit:

    0: (b7) r2 = 20
    1: (b7) r3 = 0
    2: (18) r1 = 0xffff92cee3cbc600
    4: (85) call bpf_ringbuf_reserve#131
    5: (55) if r0 == 0x0 goto pc+3
    6: (bf) r1 = r0
    7: (bf) r2 = r0
    8: (85) call bpf_tcp_raw_gen_syncookie_ipv4#204
    9: (95) exit

bpf_tcp_raw_gen_syncookie_ipv4 expects R1 and R2 to be
ARG_PTR_TO_FIXED_SIZE_MEM (with a size of at least sizeof(struct iphdr)
for R1). R0 is a ring buffer payload of 20B and therefore matches this
requirement.

The verifier reaches the check on ref_obj_id while verifying R2 and
rejects the program because the helper isn't supposed to take two
referenced arguments.

This case is a legitimate rejection and doesn't indicate a kernel bug,
so we shouldn't log it as such and shouldn't emit a kernel warning.

Link: https://github.com/google/syzkaller/pull/4313 [1]
Link: https://lore.kernel.org/all/686491d6.a70a0220.3b7e22.20ea.GAE@google.com/T/ [2]
Fixes: 457f44363a ("bpf: Implement BPF ring buffer and verifier support for it")
Fixes: 0df1a55afa ("bpf: Warn on internal verifier errors")
Reported-by: syzbot+69014a227f8edad4d8c6@syzkaller.appspotmail.com
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Link: https://lore.kernel.org/r/cd09afbfd7bef10bbc432d72693f78ffdc1e8ee5.1751463262.git.paul.chaignon@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
2025-07-02 10:43:38 -07:00
Eduard Zingerman
c3b9faac9b bpf: avoid jump misprediction for PTR_TO_MEM | PTR_UNTRUSTED
Commit f2362a57ae ("bpf: allow void* cast using bpf_rdonly_cast()")
added a notion of PTR_TO_MEM | MEM_RDONLY | PTR_UNTRUSTED type.
This simultaneously introduced a bug in jump prediction logic for
situations like below:

  p = bpf_rdonly_cast(..., 0);
  if (p) a(); else b();

Here verifier would wrongly predict that else branch cannot be taken.
This happens because:
- Function reg_not_null() assumes that PTR_TO_MEM w/o PTR_MAYBE_NULL
  flag cannot be zero.
- The call to bpf_rdonly_cast() returns a rdonly_untrusted_mem value
  w/o PTR_MAYBE_NULL flag.

Tracking of PTR_MAYBE_NULL flag for untrusted PTR_TO_MEM does not make
sense, as the main purpose of the flag is to catch null pointer access
errors. Such errors are not possible on load of PTR_UNTRUSTED values
and verifier makes sure that PTR_UNTRUSTED can't be passed to helpers
or kfuncs.

Hence, modify reg_not_null() to assume that nullness of untrusted
PTR_TO_MEM is not known.

Fixes: f2362a57ae ("bpf: allow void* cast using bpf_rdonly_cast()")
Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250702073620.897517-1-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
2025-07-02 10:43:10 -07:00
Yury Norov [NVIDIA]
e0e9506523 smp: Defer check for local execution in smp_call_function_many_cond()
Defer check for local execution to the actual place where it is needed,
which removes the extra local variable.

Signed-off-by: Yury Norov [NVIDIA] <yury.norov@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250623000010.10124-5-yury.norov@gmail.com
2025-07-02 19:13:14 +02:00
Song Liu
5bc9557c9f bpf: Mark cgroup_subsys_state->cgroup RCU safe
Mark struct cgroup_subsys_state->cgroup as safe under RCU read lock. This
will enable accessing css->cgroup from a bpf css iterator.

Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/20250623063854.1896364-4-song@kernel.org
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-02 14:18:20 +02:00
Song Liu
b95ee9049c bpf: Introduce bpf_cgroup_read_xattr to read xattr of cgroup's node
BPF programs, such as LSM and sched_ext, would benefit from tags on
cgroups. One common practice to apply such tags is to set xattrs on
cgroupfs folders.

Introduce kfunc bpf_cgroup_read_xattr, which allows reading cgroup's
xattr.

Note that, we already have bpf_get_[file|dentry]_xattr. However, these
two APIs are not ideal for reading cgroupfs xattrs, because:

  1) These two APIs only works in sleepable contexts;
  2) There is no kfunc that matches current cgroup to cgroupfs dentry.

bpf_cgroup_read_xattr is generic and can be useful for many program
types. It is also safe, because it requires trusted or rcu protected
argument (KF_RCU). Therefore, we make it available to all program types.

Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/20250623063854.1896364-3-song@kernel.org
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-02 14:18:20 +02:00
Masami Hiramatsu (Google)
2867495dea tracing: tprobe-events: Register tracepoint when enable tprobe event
As same as fprobe, register tracepoint stub function only when enabling
tprobe events. The major changes are introducing a list of
tracepoint_user and its lock, and tprobe_event_module_nb, which is
another module notifier for module loading/unloading.  By spliting the
lock from event_mutex and a module notifier for trace_fprobe, it
solved AB-BA lock dependency issue between event_mutex and
tracepoint_module_list_mutex.

Link: https://lore.kernel.org/all/174343538901.843280.423773753642677941.stgit@devnote2/

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2025-07-02 09:45:35 +09:00
Masami Hiramatsu (Google)
2db832ec90 tracing: fprobe-events: Register fprobe-events only when it is enabled
Currently fprobe events are registered when it is defined. Thus it will
give some overhead even if it is disabled. This changes it to register the
fprobe only when it is enabled.

Link: https://lore.kernel.org/all/174343537128.843280.16131300052837035043.stgit@devnote2/

Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2025-07-02 09:45:09 +09:00
Masami Hiramatsu (Google)
e3d6e1b9a3 tracing: tprobe-events: Support multiple tprobes on the same tracepoint
Allow user to set multiple tracepoint-probe events on the same
tracepoint. After the last tprobe-event is removed, the tracepoint
callback is unregistered.

Link: https://lore.kernel.org/all/174343536245.843280.6548776576601537671.stgit@devnote2/

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2025-07-02 09:44:55 +09:00
Masami Hiramatsu (Google)
c135ab4a96 tracing: tprobe-events: Remove mod field from tprobe-event
Remove unneeded 'mod' struct module pointer field from trace_fprobe
because we don't need to save this info.

Link: https://lore.kernel.org/all/174343535351.843280.5868426549023332120.stgit@devnote2/

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2025-07-02 09:44:36 +09:00
Masami Hiramatsu (Google)
ddb017ec9c tracing: probe-events: Cleanup entry-arg storing code
Cleanup __store_entry_arg() so that it is easier to understand.
The main complexity may come from combining the loops for finding
stored-entry-arg and max-offset and appending new entry.

This split those different loops into 3 parts, lookup the same
entry-arg, find the max offset and append new entry.

Link: https://lore.kernel.org/all/174323039929.348535.4705349977127704120.stgit@devnote2/

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2025-07-02 09:44:12 +09:00
Paul Chaignon
f824274587 bpf: Reject %p% format string in bprintf-like helpers
static const char fmt[] = "%p%";
    bpf_trace_printk(fmt, sizeof(fmt));

The above BPF program isn't rejected and causes a kernel warning at
runtime:

    Please remove unsupported %\x00 in format string
    WARNING: CPU: 1 PID: 7244 at lib/vsprintf.c:2680 format_decode+0x49c/0x5d0

This happens because bpf_bprintf_prepare skips over the second %,
detected as punctuation, while processing %p. This patch fixes it by
not skipping over punctuation. %\x00 is then processed in the next
iteration and rejected.

Reported-by: syzbot+e2c932aec5c8a6e1d31c@syzkaller.appspotmail.com
Fixes: 48cac3f4a9 ("bpf: Implement formatted output helpers with bstr_printf")
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Link: https://lore.kernel.org/r/a0e06cc479faec9e802ae51ba5d66420523251ee.1751395489.git.paul.chaignon@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-07-01 15:22:46 -07:00
Paul Chaignon
0df1a55afa bpf: Warn on internal verifier errors
This patch is a follow up to commit 1cb0f56d96 ("bpf: WARN_ONCE on
verifier bugs"). It generalizes the use of verifier_error throughout
the verifier, in particular for logs previously marked "verifier
internal error". As a consequence, all of those verifier bugs will now
come with a kernel warning (under CONFIG_DBEUG_KERNEL) detectable by
fuzzers.

While at it, some error messages that were too generic (ex., "bpf
verifier is misconfigured") have been reworded.

Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Link: https://lore.kernel.org/r/aGQqnzMyeagzgkCK@Tunnel
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-07-01 12:38:30 -07:00
Daniel Wagner
b6139a6abf lib/group_cpus: Let group_cpu_evenly() return the number of initialized masks
group_cpu_evenly() might have allocated less groups then requested:

group_cpu_evenly()
  __group_cpus_evenly()
    alloc_nodes_groups()
      # allocated total groups may be less than numgrps when
      # active total CPU number is less then numgrps

In this case, the caller will do an out of bound access because the
caller assumes the masks returned has numgrps.

Return the number of groups created so the caller can limit the access
range accordingly.

Acked-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Daniel Wagner <wagi@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250617-isolcpus-queue-counters-v1-1-13923686b54b@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-01 10:24:11 -06:00
Greg Kroah-Hartman
e78f70bad2 time/timecounter: Fix the lie that struct cyclecounter is const
In both the read callback for struct cyclecounter, and in struct
timecounter, struct cyclecounter is declared as a const pointer.

Unfortunatly, a number of users of this pointer treat it as a non-const
pointer as it is burried in a larger structure that is heavily modified by
the callback function when accessed.  This lie had been hidden by the fact
that container_of() "casts away" a const attribute of a pointer without any
compiler warning happening at all.

Fix this all up by removing the const attribute in the needed places so
that everyone can see that the structure really isn't const, but can,
and is, modified by the users of it.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/2025070124-backyard-hurt-783a@gregkh
2025-07-01 15:38:25 +02:00
Peter Zijlstra
009836b4fa sched/core: Fix migrate_swap() vs. hotplug
On Mon, Jun 02, 2025 at 03:22:13PM +0800, Kuyo Chang wrote:

> So, the potential race scenario is:
>
> 	CPU0							CPU1
> 	// doing migrate_swap(cpu0/cpu1)
> 	stop_two_cpus()
> 							  ...
> 							 // doing _cpu_down()
> 							      sched_cpu_deactivate()
> 								set_cpu_active(cpu, false);
> 								balance_push_set(cpu, true);
> 	cpu_stop_queue_two_works
> 	    __cpu_stop_queue_work(stopper1,...);
> 	    __cpu_stop_queue_work(stopper2,..);
> 	stop_cpus_in_progress -> true
> 		preempt_enable();
> 								...
> 							1st balance_push
> 							stop_one_cpu_nowait
> 							cpu_stop_queue_work
> 							__cpu_stop_queue_work
> 							list_add_tail  -> 1st add push_work
> 							wake_up_q(&wakeq);  -> "wakeq is empty.
> 										This implies that the stopper is at wakeq@migrate_swap."
> 	preempt_disable
> 	wake_up_q(&wakeq);
> 	        wake_up_process // wakeup migrate/0
> 		    try_to_wake_up
> 		        ttwu_queue
> 		            ttwu_queue_cond ->meet below case
> 		                if (cpu == smp_processor_id())
> 			         return false;
> 			ttwu_do_activate
> 			//migrate/0 wakeup done
> 		wake_up_process // wakeup migrate/1
> 	           try_to_wake_up
> 		    ttwu_queue
> 			ttwu_queue_cond
> 		        ttwu_queue_wakelist
> 			__ttwu_queue_wakelist
> 			__smp_call_single_queue
> 	preempt_enable();
>
> 							2nd balance_push
> 							stop_one_cpu_nowait
> 							cpu_stop_queue_work
> 							__cpu_stop_queue_work
> 							list_add_tail  -> 2nd add push_work, so the double list add is detected
> 							...
> 							...
> 							cpu1 get ipi, do sched_ttwu_pending, wakeup migrate/1
>

So this balance_push() is part of schedule(), and schedule() is supposed
to switch to stopper task, but because of this race condition, stopper
task is stuck in WAKING state and not actually visible to be picked.

Therefore CPU1 can do another schedule() and end up doing another
balance_push() even though the last one hasn't been done yet.

This is a confluence of fail, where both wake_q and ttwu_wakelist can
cause crucial wakeups to be delayed, resulting in the malfunction of
balance_push.

Since there is only a single stopper thread to be woken, the wake_q
doesn't really add anything here, and can be removed in favour of
direct wakeups of the stopper thread.

Then add a clause to ttwu_queue_cond() to ensure the stopper threads
are never queued / delayed.

Of all 3 moving parts, the last addition was the balance_push()
machinery, so pick that as the point the bug was introduced.

Fixes: 2558aacff8 ("sched/hotplug: Ensure only per-cpu kthreads run during hotplug")
Reported-by: Kuyo Chang <kuyo.chang@mediatek.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Kuyo Chang <kuyo.chang@mediatek.com>
Link: https://lkml.kernel.org/r/20250605100009.GO39944@noisy.programming.kicks-ass.net
2025-07-01 15:02:03 +02:00
Thomas Weißschuh
3ebb1b6522 sched: Fix preemption string of preempt_dynamic_none
Zero is a valid value for "preempt_dynamic_mode", namely
"preempt_dynamic_none".

Fix the off-by-one in preempt_model_str(), so that "preempty_dynamic_none"
is correctly formatted as PREEMPT(none) instead of PREEMPT(undef).

Fixes: 8bdc5daaa0 ("sched: Add a generic function to return the preemption string")
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20250626-preempt-str-none-v2-1-526213b70a89@linutronix.de
2025-07-01 15:02:02 +02:00
Thomas Gleixner
5173ac2dc8 Merge tag 'entry-split-for-arm' into core/entry
Prerequisite for ARM[64] generic entry conversion

Merge it into the entry branch so further changes can be based on it.
2025-06-30 19:54:19 +02:00
Jinjie Ruan
a70e9f647f entry: Split generic entry into generic exception and syscall entry
Currently CONFIG_GENERIC_ENTRY enables both the generic exception
entry logic and the generic syscall entry logic, which are otherwise
loosely coupled.

Introduce separate config options for these so that architectures can
select the two independently. This will make it easier for
architectures to migrate to generic entry code.

Suggested-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Linus Walleij <linus.walleij@linaro.org>
Link: https://lore.kernel.org/20250213130007.1418890-2-ruanjinjie@huawei.com
Link: https://lore.kernel.org/all/20250624-generic-entry-split-v1-1-53d5ef4f94df@linaro.org

[Linus Walleij: rebase onto v6.16-rc1]
2025-06-30 19:52:55 +02:00
Luo Gengkun
7b4c5a3754 perf/core: Fix the WARN_ON_ONCE is out of lock protected region
commit 3172fb9866 ("perf/core: Fix WARN in perf_cgroup_switch()") try to
fix a concurrency problem between perf_cgroup_switch and
perf_cgroup_event_disable. But it does not to move the WARN_ON_ONCE into
lock-protected region, so the warning is still be triggered.

Fixes: 3172fb9866 ("perf/core: Fix WARN in perf_cgroup_switch()")
Signed-off-by: Luo Gengkun <luogengkun@huaweicloud.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250626135403.2454105-1-luogengkun@huaweicloud.com
2025-06-30 09:32:49 +02:00
Linus Torvalds
2fc18d0b89 Merge tag 'perf_urgent_for_v6.16_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fix from Borislav Petkov:

 - Make sure an AUX perf event is really disabled when it overruns

* tag 'perf_urgent_for_v6.16_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/aux: Fix pending disable flow when the AUX ring buffer overruns
2025-06-29 08:16:02 -07:00
Linus Torvalds
ded779017a Merge tag 'trace-v6.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fix from Steven Rostedt:

 - Fix possible UAF on error path in filter_free_subsystem_filters()

   When freeing a subsystem filter, the filter for the subsystem is
   passed in to be freed and all the events within the subsystem will
   have their filter freed too. In order to free without waiting for RCU
   synchronization, list items are allocated to hold what is going to be
   freed to free it via a call_rcu(). If the allocation of these items
   fails, it will call the synchronization directly and free after that
   (causing a bit of delay for the user).

   The subsystem filter is first added to this list and then the filters
   for all the events under the subsystem. The bug is if one of the
   allocations of the list items for the event filters fail to allocate,
   it jumps to the "free_now" label which will free the subsystem
   filter, then all the items on the allocated list, and then the event
   filters that were not added to the list yet. But because the
   subsystem filter was added first, it gets freed twice.

   The solution is to add the subsystem filter after the events, and
   then if any of the allocations fail it will not try to free any of
   them twice

* tag 'trace-v6.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing: Fix filter logic error
2025-06-28 11:39:24 -07:00
Linus Torvalds
0fd39af24e Merge tag 'mm-hotfixes-stable-2025-06-27-16-56' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
 "16 hotfixes.

  6 are cc:stable and the remainder address post-6.15 issues or aren't
  considered necessary for -stable kernels. 5 are for MM"

* tag 'mm-hotfixes-stable-2025-06-27-16-56' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  MAINTAINERS: add Lorenzo as THP co-maintainer
  mailmap: update Duje Mihanović's email address
  selftests/mm: fix validate_addr() helper
  crashdump: add CONFIG_KEYS dependency
  mailmap: correct name for a historical account of Zijun Hu
  mailmap: add entries for Zijun Hu
  fuse: fix runtime warning on truncate_folio_batch_exceptionals()
  scripts/gdb: fix dentry_name() lookup
  mm/damon/sysfs-schemes: free old damon_sysfs_scheme_filter->memcg_path on write
  mm/alloc_tag: fix the kmemleak false positive issue in the allocation of the percpu variable tag->counters
  lib/group_cpus: fix NULL pointer dereference from group_cpus_evenly()
  mm/hugetlb: remove unnecessary holding of hugetlb_lock
  MAINTAINERS: add missing files to mm page alloc section
  MAINTAINERS: add tree entry to mm init block
  mm: add OOM killer maintainer structure
  fs/proc/task_mmu: fix PAGE_IS_PFNZERO detection for the huge zero folio
2025-06-27 20:34:10 -07:00
Edward Adam Davis
6921d1e07c tracing: Fix filter logic error
If the processing of the tr->events loop fails, the filter that has been
added to filter_head will be released twice in free_filter_list(&head->rcu)
and __free_filter(filter).

After adding the filter of tr->events, add the filter to the filter_head
process to avoid triggering uaf.

Link: https://lore.kernel.org/tencent_4EF87A626D702F816CD0951CE956EC32CD0A@qq.com
Fixes: a9d0aab5eb ("tracing: Fix regression of filter waiting a long time on RCU synchronization")
Reported-by: syzbot+daba72c4af9915e9c894@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=daba72c4af9915e9c894
Tested-by: syzbot+daba72c4af9915e9c894@syzkaller.appspotmail.com
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Edward Adam Davis <eadavis@qq.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-06-27 15:51:36 -04:00
Eduard Zingerman
a5a7b25d75 bpf: guard BTF_ID_FLAGS(bpf_cgroup_read_xattr) with CONFIG_BPF_LSM
Function bpf_cgroup_read_xattr is defined in fs/bpf_fs_kfuncs.c,
which is compiled only when CONFIG_BPF_LSM is set. Add CONFIG_BPF_LSM
check to bpf_cgroup_read_xattr spec in common_btf_ids in
kernel/bpf/helpers.c to avoid build failures for configs w/o
CONFIG_BPF_LSM.

Build failure example:

    BTF     .tmp_vmlinux1.btf.o
  btf_encoder__tag_kfunc: failed to find kfunc 'bpf_cgroup_read_xattr' in BTF
  ...
  WARN: resolve_btfids: unresolved symbol bpf_cgroup_read_xattr
  make[2]: *** [scripts/Makefile.vmlinux:91: vmlinux.unstripped] Error 255

Fixes: 535b070f4a ("bpf: Introduce bpf_cgroup_read_xattr to read xattr of cgroup's node")
Reported-by: Jake Hillion <jakehillion@meta.com>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250627175309.2710973-1-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-06-27 12:18:42 -07:00
Thomas Gleixner
7b95663a3d timekeeping: Provide interface to control auxiliary clocks
Auxiliary clocks are disabled by default and attempts to access them
fail.

Provide an interface to enable/disable them at run-time.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20250625183758.444626478@linutronix.de
2025-06-27 20:13:13 +02:00
Thomas Gleixner
e6d4c00719 timekeeping: Provide update for auxiliary timekeepers
Update the auxiliary timekeepers periodically. For now this is tied to the system
timekeeper update from the tick. This might be revisited and moved out of the tick.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20250625183758.382451331@linutronix.de
2025-06-27 20:13:13 +02:00
Thomas Gleixner
ecf3e70304 timekeeping: Provide adjtimex() for auxiliary clocks
The behaviour is close to clock_adtime(CLOCK_REALTIME) with the
following differences:

  1) ADJ_SETOFFSET adjusts the auxiliary clock offset
  
  2) ADJ_TAI is not supported

  3) Leap seconds are not supported

  4) PPS is not supported

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20250625183758.317946543@linutronix.de
2025-06-27 20:13:13 +02:00
Thomas Gleixner
4eca49d0b6 timekeeping: Prepare do_adtimex() for auxiliary clocks
Exclude ADJ_TAI, leap seconds and PPS functionality as they make no sense
in the context of auxiliary clocks and provide a time stamp based on the
actual clock.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20250625183758.253203783@linutronix.de
2025-06-27 20:13:13 +02:00
Thomas Gleixner
775f71ebed timekeeping: Make do_adjtimex() reusable
Split out the actual functionality of adjtimex() and make do_adjtimex() a
wrapper which feeds the core timekeeper into it and handles the result
including audit at the call site.

This allows to reuse the actual functionality for auxiliary clocks.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20250625183758.187322876@linutronix.de
2025-06-27 20:13:13 +02:00
Thomas Gleixner
2c8aea59c2 timekeeping: Add auxiliary clock support to __timekeeping_inject_offset()
Redirect the relative offset adjustment to the auxiliary clock offset
instead of modifying CLOCK_REALTIME, which has no meaning in context of
these clocks.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20250625183758.124057787@linutronix.de
2025-06-27 20:13:13 +02:00
Thomas Gleixner
e8db3a5579 timekeeping: Make timekeeping_inject_offset() reusable
Split out the inner workings for auxiliary clock support and feed the core time
keeper into it.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20250625183758.059934561@linutronix.de
2025-06-27 20:13:13 +02:00
Thomas Gleixner
60ecc26ec5 timekeeping: Provide time setter for auxiliary clocks
Add clock_settime(2) support for auxiliary clocks. The function affects the
AUX offset which is added to the "monotonic" clock readout of these clocks.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20250625183757.995688714@linutronix.de
2025-06-27 20:13:12 +02:00
Thomas Gleixner
606424bf4f timekeeping: Add minimal posix-timers support for auxiliary clocks
Provide clock_getres(2) and clock_gettime(2) for auxiliary clocks.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20250625183757.932220594@linutronix.de
2025-06-27 20:13:12 +02:00
Thomas Gleixner
05bc6e6290 timekeeping: Provide time getters for auxiliary clocks
Provide interfaces similar to the ktime_get*() family which provide access
to the auxiliary clocks.

These interfaces have a boolean return value, which indicates whether the
accessed clock is valid or not.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20250625183757.868342628@linutronix.de
2025-06-27 20:13:12 +02:00
Thomas Gleixner
9f7729480a timekeeping: Update auxiliary timekeepers on clocksource change
Propagate a system clocksource change to the auxiliary timekeepers so that
they can pick up the new clocksource.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20250625183757.803890875@linutronix.de
2025-06-27 20:13:12 +02:00
Viktor Malik
5272b51367 bpf: Fix string kfuncs names in doc comments
Documentation comments for bpf_strnlen and bpf_strcspn contained
incorrect function names.

Fixes: e91370550f ("bpf: Add kfuncs for read-only string operations")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Closes: https://lore.kernel.org/bpf/20250627174759.3a435f86@canb.auug.org.au/T/#u
Signed-off-by: Viktor Malik <vmalik@redhat.com>
Link: https://lore.kernel.org/r/20250627082001.237606-1-vmalik@redhat.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-06-27 08:48:06 -07:00
Alexei Starovoitov
48d998af99 Merge branch 'vfs-6.17.bpf' of https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Merge branch 'vfs-6.17.bpf' from vfs tree into bpf-next/master
and resolve conflicts.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-06-26 19:01:04 -07:00
Yury Norov [NVIDIA]
976e0e3103 smp: Use cpumask_any_but() in smp_call_function_many_cond()
smp_call_function_many_cond() opencodes cpumask_any_but().

Signed-off-by: Yury Norov [NVIDIA] <yury.norov@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250623000010.10124-3-yury.norov@gmail.com
2025-06-26 23:46:34 +02:00
Yury Norov [NVIDIA]
5f295519b4 smp: Improve locality in smp_call_function_any()
smp_call_function_any() tries to make a local call as it's the cheapest
option, or switches to a CPU in the same node. If it's not possible, the
algorithm gives up and searches for any CPU, in a numerical order.

Instead, it can search for the best CPU based on NUMA locality, including
the 2nd nearest hop (a set of equidistant nodes), and higher.

sched_numa_find_nth_cpu() does exactly that, and also helps to drop most
of the housekeeping code.

Signed-off-by: Yury Norov [NVIDIA] <yury.norov@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250623000010.10124-2-yury.norov@gmail.com
2025-06-26 23:46:34 +02:00
Mario Limonciello
12ffc3b151 PM: Restrict swap use to later in the suspend sequence
Currently swap is restricted before drivers have had a chance to do
their prepare() PM callbacks. Restricting swap this early means that if
a driver needs to evict some content from memory into sawp in it's
prepare callback, it won't be able to.

On AMD dGPUs this can lead to failed suspends under memory pressure
situations as all VRAM must be evicted to system memory or swap.

Move the swap restriction to right after all devices have had a chance
to do the prepare() callback.  If there is any problem with the sequence,
restore swap in the appropriate dpm resume callbacks or error handling
paths.

Closes: https://github.com/ROCm/ROCK-Kernel-Driver/issues/174
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/2362
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Tested-by: Nat Wittstock <nat@fardog.io>
Tested-by: Lucian Langa <lucilanga@7pot.org>
Link: https://patch.msgid.link/20250613214413.4127087-1-superm1@kernel.org
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-06-26 20:39:34 +02:00
Alexei Starovoitov
886178a33a Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf after rc3
Cross-merge BPF, perf and other fixes after downstream PRs.
It restores BPF CI to green after critical fix
commit bc4394e5e7 ("perf: Fix the throttle error of some clock events")

No conflicts.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-06-26 09:49:39 -07:00
Viktor Malik
e91370550f bpf: Add kfuncs for read-only string operations
String operations are commonly used so this exposes the most common ones
to BPF programs. For now, we limit ourselves to operations which do not
copy memory around.

Unfortunately, most in-kernel implementations assume that strings are
%NUL-terminated, which is not necessarily true, and therefore we cannot
use them directly in the BPF context. Instead, we open-code them using
__get_kernel_nofault instead of plain dereference to make them safe and
limit the strings length to XATTR_SIZE_MAX to make sure the functions
terminate. When __get_kernel_nofault fails, functions return -EFAULT.
Similarly, when the size bound is reached, the functions return -E2BIG.
In addition, we return -ERANGE when the passed strings are outside of
the kernel address space.

Note that thanks to these dynamic safety checks, no other constraints
are put on the kfunc args (they are marked with the "__ign" suffix to
skip any verifier checks for them).

All of the functions return integers, including functions which normally
(in kernel or libc) return pointers to the strings. The reason is that
since the strings are generally treated as unsafe, the pointers couldn't
be dereferenced anyways. So, instead, we return an index to the string
and let user decide what to do with it. This also nicely fits with
returning various error codes when necessary (see above).

Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Viktor Malik <vmalik@redhat.com>
Link: https://lore.kernel.org/r/4b008a6212852c1b056a413f86e3efddac73551c.1750917800.git.vmalik@redhat.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-06-26 09:44:45 -07:00
Leo Yan
1476b21832 perf/aux: Fix pending disable flow when the AUX ring buffer overruns
If an AUX event overruns, the event core layer intends to disable the
event by setting the 'pending_disable' flag. Unfortunately, the event
is not actually disabled afterwards.

In commit:

  ca6c21327c ("perf: Fix missing SIGTRAPs")

the 'pending_disable' flag was changed to a boolean. However, the
AUX event code was not updated accordingly. The flag ends up holding a
CPU number. If this number is zero, the flag is taken as false and the
IRQ work is never triggered.

Later, with commit:

  2b84def990 ("perf: Split __perf_pending_irq() out of perf_pending_irq()")

a new IRQ work 'pending_disable_irq' was introduced to handle event
disabling. The AUX event path was not updated to kick off the work queue.

To fix this bug, when an AUX ring buffer overrun is detected, call
perf_event_disable_inatomic() to initiate the pending disable flow.

Also update the outdated comment for setting the flag, to reflect the
boolean values (0 or 1).

Fixes: 2b84def990 ("perf: Split __perf_pending_irq() out of perf_pending_irq()")
Fixes: ca6c21327c ("perf: Fix missing SIGTRAPs")
Signed-off-by: Leo Yan <leo.yan@arm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: James Clark <james.clark@linaro.org>
Reviewed-by: Yeoreum Yun <yeoreum.yun@arm.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Ian Rogers <irogers@google.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Liang Kan <kan.liang@linux.intel.com>
Cc: Marco Elver <elver@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: linux-perf-users@vger.kernel.org
Link: https://lore.kernel.org/r/20250625170737.2918295-1-leo.yan@arm.com
2025-06-26 10:50:37 +02:00