Currently there is no straightforward way to fill dynptr memory with a
value (most commonly zero). One can do it with bpf_dynptr_write(), but
a temporary buffer is necessary for that.
Implement bpf_dynptr_memset() - an analogue of memset() from libc.
Signed-off-by: Ihor Solodrai <isolodrai@meta.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20250702210309.3115903-2-isolodrai@meta.com
When the computer enters sleep status without a monitor
connected, the system switches the console to the virtual
terminal tty63(SUSPEND_CONSOLE).
If a monitor is subsequently connected before waking up,
the system skips the required VT restoration process
during wake-up, leaving the console on tty63 instead of
switching back to tty1.
To fix this issue, a global flag vt_switch_done is introduced
to record whether the system has successfully switched to
the suspend console via vt_move_to_console() during suspend.
If the switch was completed, vt_switch_done is set to 1.
Later during resume, this flag is checked to ensure that
the original console is restored properly by calling
vt_move_to_console(orig_fgconsole, 0).
This prevents scenarios where the resume logic skips console
restoration due to incorrect detection of the console state,
especially when a monitor is reconnected before waking up.
Signed-off-by: tuhaowen <tuhaowen@uniontech.com>
Link: https://patch.msgid.link/20250611032345.29962-1-tuhaowen@uniontech.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Base implementation for PTP with a temporary CLOCK_AUX* workaround to
allow integration of depending changes into the networking tree.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Pull the base implementation of ktime_get_clock_ts64() for PTP, which
contains a temporary CLOCK_AUX* workaround. That was created to allow
integration of depending changes into the networking tree. The workaround
is going to be removed in a subsequent change in the timekeeping tree.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
PTP implements an inline switch case for taking timestamps from various
POSIX clock IDs, which already consumes quite some text space. Expanding it
for auxiliary clocks really becomes too big for inlining.
Provide a out of line version.
The function invalidates the timestamp in case the clock is invalid. The
invalidation allows to implement a validation check without the need to
propagate a return value through deep existing call chains.
Due to merge logistics this temporarily defines CLOCK_AUX[_LAST] if
undefined, so that the plain branch, which does not contain any of the core
timekeeper changes, can be pulled into the networking tree as prerequisite
for the PTP side changes. These temporary defines are removed after that
branch is merged into the tip::timers/ptp branch. That way the result in
-next or upstream in the next merge window has zero dependencies.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20250701132628.357686408@linutronix.de
Jann reports that uprobes can be used destructively when used in the
middle of an instruction. The kernel only verifies there is a valid
instruction at the requested offset, but due to variable instruction
length cannot determine if this is an instruction as seen by the
intended execution stream.
Additionally, Mark Rutland notes that on architectures that mix data
in the text segment (like arm64), a similar things can be done if the
data word is 'mistaken' for an instruction.
As such, require CAP_SYS_ADMIN for uprobes.
Fixes: c9e0924e5c ("perf/core: open access to probes for CAP_PERFMON privileged process")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/CAG48ez1n4520sq0XrWYDHKiKxE_+WCfAK+qt9qkY4ZiBGmL-5g@mail.gmail.com
The description of full helper calls in syzkaller [1] and the addition of
kernel warnings in commit 0df1a55afa ("bpf: Warn on internal verifier
errors") allowed syzbot to reach a verifier state that was thought to
indicate a verifier bug [2]:
12: (85) call bpf_tcp_raw_gen_syncookie_ipv4#204
verifier bug: more than one arg with ref_obj_id R2 2 2
This error can be reproduced with the program from the previous commit:
0: (b7) r2 = 20
1: (b7) r3 = 0
2: (18) r1 = 0xffff92cee3cbc600
4: (85) call bpf_ringbuf_reserve#131
5: (55) if r0 == 0x0 goto pc+3
6: (bf) r1 = r0
7: (bf) r2 = r0
8: (85) call bpf_tcp_raw_gen_syncookie_ipv4#204
9: (95) exit
bpf_tcp_raw_gen_syncookie_ipv4 expects R1 and R2 to be
ARG_PTR_TO_FIXED_SIZE_MEM (with a size of at least sizeof(struct iphdr)
for R1). R0 is a ring buffer payload of 20B and therefore matches this
requirement.
The verifier reaches the check on ref_obj_id while verifying R2 and
rejects the program because the helper isn't supposed to take two
referenced arguments.
This case is a legitimate rejection and doesn't indicate a kernel bug,
so we shouldn't log it as such and shouldn't emit a kernel warning.
Link: https://github.com/google/syzkaller/pull/4313 [1]
Link: https://lore.kernel.org/all/686491d6.a70a0220.3b7e22.20ea.GAE@google.com/T/ [2]
Fixes: 457f44363a ("bpf: Implement BPF ring buffer and verifier support for it")
Fixes: 0df1a55afa ("bpf: Warn on internal verifier errors")
Reported-by: syzbot+69014a227f8edad4d8c6@syzkaller.appspotmail.com
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Link: https://lore.kernel.org/r/cd09afbfd7bef10bbc432d72693f78ffdc1e8ee5.1751463262.git.paul.chaignon@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Commit f2362a57ae ("bpf: allow void* cast using bpf_rdonly_cast()")
added a notion of PTR_TO_MEM | MEM_RDONLY | PTR_UNTRUSTED type.
This simultaneously introduced a bug in jump prediction logic for
situations like below:
p = bpf_rdonly_cast(..., 0);
if (p) a(); else b();
Here verifier would wrongly predict that else branch cannot be taken.
This happens because:
- Function reg_not_null() assumes that PTR_TO_MEM w/o PTR_MAYBE_NULL
flag cannot be zero.
- The call to bpf_rdonly_cast() returns a rdonly_untrusted_mem value
w/o PTR_MAYBE_NULL flag.
Tracking of PTR_MAYBE_NULL flag for untrusted PTR_TO_MEM does not make
sense, as the main purpose of the flag is to catch null pointer access
errors. Such errors are not possible on load of PTR_UNTRUSTED values
and verifier makes sure that PTR_UNTRUSTED can't be passed to helpers
or kfuncs.
Hence, modify reg_not_null() to assume that nullness of untrusted
PTR_TO_MEM is not known.
Fixes: f2362a57ae ("bpf: allow void* cast using bpf_rdonly_cast()")
Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250702073620.897517-1-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
BPF programs, such as LSM and sched_ext, would benefit from tags on
cgroups. One common practice to apply such tags is to set xattrs on
cgroupfs folders.
Introduce kfunc bpf_cgroup_read_xattr, which allows reading cgroup's
xattr.
Note that, we already have bpf_get_[file|dentry]_xattr. However, these
two APIs are not ideal for reading cgroupfs xattrs, because:
1) These two APIs only works in sleepable contexts;
2) There is no kfunc that matches current cgroup to cgroupfs dentry.
bpf_cgroup_read_xattr is generic and can be useful for many program
types. It is also safe, because it requires trusted or rcu protected
argument (KF_RCU). Therefore, we make it available to all program types.
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/20250623063854.1896364-3-song@kernel.org
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
As same as fprobe, register tracepoint stub function only when enabling
tprobe events. The major changes are introducing a list of
tracepoint_user and its lock, and tprobe_event_module_nb, which is
another module notifier for module loading/unloading. By spliting the
lock from event_mutex and a module notifier for trace_fprobe, it
solved AB-BA lock dependency issue between event_mutex and
tracepoint_module_list_mutex.
Link: https://lore.kernel.org/all/174343538901.843280.423773753642677941.stgit@devnote2/
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cleanup __store_entry_arg() so that it is easier to understand.
The main complexity may come from combining the loops for finding
stored-entry-arg and max-offset and appending new entry.
This split those different loops into 3 parts, lookup the same
entry-arg, find the max offset and append new entry.
Link: https://lore.kernel.org/all/174323039929.348535.4705349977127704120.stgit@devnote2/
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
This patch is a follow up to commit 1cb0f56d96 ("bpf: WARN_ONCE on
verifier bugs"). It generalizes the use of verifier_error throughout
the verifier, in particular for logs previously marked "verifier
internal error". As a consequence, all of those verifier bugs will now
come with a kernel warning (under CONFIG_DBEUG_KERNEL) detectable by
fuzzers.
While at it, some error messages that were too generic (ex., "bpf
verifier is misconfigured") have been reworded.
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Link: https://lore.kernel.org/r/aGQqnzMyeagzgkCK@Tunnel
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
group_cpu_evenly() might have allocated less groups then requested:
group_cpu_evenly()
__group_cpus_evenly()
alloc_nodes_groups()
# allocated total groups may be less than numgrps when
# active total CPU number is less then numgrps
In this case, the caller will do an out of bound access because the
caller assumes the masks returned has numgrps.
Return the number of groups created so the caller can limit the access
range accordingly.
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Daniel Wagner <wagi@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250617-isolcpus-queue-counters-v1-1-13923686b54b@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In both the read callback for struct cyclecounter, and in struct
timecounter, struct cyclecounter is declared as a const pointer.
Unfortunatly, a number of users of this pointer treat it as a non-const
pointer as it is burried in a larger structure that is heavily modified by
the callback function when accessed. This lie had been hidden by the fact
that container_of() "casts away" a const attribute of a pointer without any
compiler warning happening at all.
Fix this all up by removing the const attribute in the needed places so
that everyone can see that the structure really isn't const, but can,
and is, modified by the users of it.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/2025070124-backyard-hurt-783a@gregkh
On Mon, Jun 02, 2025 at 03:22:13PM +0800, Kuyo Chang wrote:
> So, the potential race scenario is:
>
> CPU0 CPU1
> // doing migrate_swap(cpu0/cpu1)
> stop_two_cpus()
> ...
> // doing _cpu_down()
> sched_cpu_deactivate()
> set_cpu_active(cpu, false);
> balance_push_set(cpu, true);
> cpu_stop_queue_two_works
> __cpu_stop_queue_work(stopper1,...);
> __cpu_stop_queue_work(stopper2,..);
> stop_cpus_in_progress -> true
> preempt_enable();
> ...
> 1st balance_push
> stop_one_cpu_nowait
> cpu_stop_queue_work
> __cpu_stop_queue_work
> list_add_tail -> 1st add push_work
> wake_up_q(&wakeq); -> "wakeq is empty.
> This implies that the stopper is at wakeq@migrate_swap."
> preempt_disable
> wake_up_q(&wakeq);
> wake_up_process // wakeup migrate/0
> try_to_wake_up
> ttwu_queue
> ttwu_queue_cond ->meet below case
> if (cpu == smp_processor_id())
> return false;
> ttwu_do_activate
> //migrate/0 wakeup done
> wake_up_process // wakeup migrate/1
> try_to_wake_up
> ttwu_queue
> ttwu_queue_cond
> ttwu_queue_wakelist
> __ttwu_queue_wakelist
> __smp_call_single_queue
> preempt_enable();
>
> 2nd balance_push
> stop_one_cpu_nowait
> cpu_stop_queue_work
> __cpu_stop_queue_work
> list_add_tail -> 2nd add push_work, so the double list add is detected
> ...
> ...
> cpu1 get ipi, do sched_ttwu_pending, wakeup migrate/1
>
So this balance_push() is part of schedule(), and schedule() is supposed
to switch to stopper task, but because of this race condition, stopper
task is stuck in WAKING state and not actually visible to be picked.
Therefore CPU1 can do another schedule() and end up doing another
balance_push() even though the last one hasn't been done yet.
This is a confluence of fail, where both wake_q and ttwu_wakelist can
cause crucial wakeups to be delayed, resulting in the malfunction of
balance_push.
Since there is only a single stopper thread to be woken, the wake_q
doesn't really add anything here, and can be removed in favour of
direct wakeups of the stopper thread.
Then add a clause to ttwu_queue_cond() to ensure the stopper threads
are never queued / delayed.
Of all 3 moving parts, the last addition was the balance_push()
machinery, so pick that as the point the bug was introduced.
Fixes: 2558aacff8 ("sched/hotplug: Ensure only per-cpu kthreads run during hotplug")
Reported-by: Kuyo Chang <kuyo.chang@mediatek.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Kuyo Chang <kuyo.chang@mediatek.com>
Link: https://lkml.kernel.org/r/20250605100009.GO39944@noisy.programming.kicks-ass.net
Pull perf fix from Borislav Petkov:
- Make sure an AUX perf event is really disabled when it overruns
* tag 'perf_urgent_for_v6.16_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/aux: Fix pending disable flow when the AUX ring buffer overruns
Pull tracing fix from Steven Rostedt:
- Fix possible UAF on error path in filter_free_subsystem_filters()
When freeing a subsystem filter, the filter for the subsystem is
passed in to be freed and all the events within the subsystem will
have their filter freed too. In order to free without waiting for RCU
synchronization, list items are allocated to hold what is going to be
freed to free it via a call_rcu(). If the allocation of these items
fails, it will call the synchronization directly and free after that
(causing a bit of delay for the user).
The subsystem filter is first added to this list and then the filters
for all the events under the subsystem. The bug is if one of the
allocations of the list items for the event filters fail to allocate,
it jumps to the "free_now" label which will free the subsystem
filter, then all the items on the allocated list, and then the event
filters that were not added to the list yet. But because the
subsystem filter was added first, it gets freed twice.
The solution is to add the subsystem filter after the events, and
then if any of the allocations fail it will not try to free any of
them twice
* tag 'trace-v6.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing: Fix filter logic error
Pull misc fixes from Andrew Morton:
"16 hotfixes.
6 are cc:stable and the remainder address post-6.15 issues or aren't
considered necessary for -stable kernels. 5 are for MM"
* tag 'mm-hotfixes-stable-2025-06-27-16-56' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
MAINTAINERS: add Lorenzo as THP co-maintainer
mailmap: update Duje Mihanović's email address
selftests/mm: fix validate_addr() helper
crashdump: add CONFIG_KEYS dependency
mailmap: correct name for a historical account of Zijun Hu
mailmap: add entries for Zijun Hu
fuse: fix runtime warning on truncate_folio_batch_exceptionals()
scripts/gdb: fix dentry_name() lookup
mm/damon/sysfs-schemes: free old damon_sysfs_scheme_filter->memcg_path on write
mm/alloc_tag: fix the kmemleak false positive issue in the allocation of the percpu variable tag->counters
lib/group_cpus: fix NULL pointer dereference from group_cpus_evenly()
mm/hugetlb: remove unnecessary holding of hugetlb_lock
MAINTAINERS: add missing files to mm page alloc section
MAINTAINERS: add tree entry to mm init block
mm: add OOM killer maintainer structure
fs/proc/task_mmu: fix PAGE_IS_PFNZERO detection for the huge zero folio
Function bpf_cgroup_read_xattr is defined in fs/bpf_fs_kfuncs.c,
which is compiled only when CONFIG_BPF_LSM is set. Add CONFIG_BPF_LSM
check to bpf_cgroup_read_xattr spec in common_btf_ids in
kernel/bpf/helpers.c to avoid build failures for configs w/o
CONFIG_BPF_LSM.
Build failure example:
BTF .tmp_vmlinux1.btf.o
btf_encoder__tag_kfunc: failed to find kfunc 'bpf_cgroup_read_xattr' in BTF
...
WARN: resolve_btfids: unresolved symbol bpf_cgroup_read_xattr
make[2]: *** [scripts/Makefile.vmlinux:91: vmlinux.unstripped] Error 255
Fixes: 535b070f4a ("bpf: Introduce bpf_cgroup_read_xattr to read xattr of cgroup's node")
Reported-by: Jake Hillion <jakehillion@meta.com>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250627175309.2710973-1-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Split out the actual functionality of adjtimex() and make do_adjtimex() a
wrapper which feeds the core timekeeper into it and handles the result
including audit at the call site.
This allows to reuse the actual functionality for auxiliary clocks.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20250625183758.187322876@linutronix.de
smp_call_function_any() tries to make a local call as it's the cheapest
option, or switches to a CPU in the same node. If it's not possible, the
algorithm gives up and searches for any CPU, in a numerical order.
Instead, it can search for the best CPU based on NUMA locality, including
the 2nd nearest hop (a set of equidistant nodes), and higher.
sched_numa_find_nth_cpu() does exactly that, and also helps to drop most
of the housekeeping code.
Signed-off-by: Yury Norov [NVIDIA] <yury.norov@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250623000010.10124-2-yury.norov@gmail.com
Cross-merge BPF, perf and other fixes after downstream PRs.
It restores BPF CI to green after critical fix
commit bc4394e5e7 ("perf: Fix the throttle error of some clock events")
No conflicts.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
String operations are commonly used so this exposes the most common ones
to BPF programs. For now, we limit ourselves to operations which do not
copy memory around.
Unfortunately, most in-kernel implementations assume that strings are
%NUL-terminated, which is not necessarily true, and therefore we cannot
use them directly in the BPF context. Instead, we open-code them using
__get_kernel_nofault instead of plain dereference to make them safe and
limit the strings length to XATTR_SIZE_MAX to make sure the functions
terminate. When __get_kernel_nofault fails, functions return -EFAULT.
Similarly, when the size bound is reached, the functions return -E2BIG.
In addition, we return -ERANGE when the passed strings are outside of
the kernel address space.
Note that thanks to these dynamic safety checks, no other constraints
are put on the kfunc args (they are marked with the "__ign" suffix to
skip any verifier checks for them).
All of the functions return integers, including functions which normally
(in kernel or libc) return pointers to the strings. The reason is that
since the strings are generally treated as unsafe, the pointers couldn't
be dereferenced anyways. So, instead, we return an index to the string
and let user decide what to do with it. This also nicely fits with
returning various error codes when necessary (see above).
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Viktor Malik <vmalik@redhat.com>
Link: https://lore.kernel.org/r/4b008a6212852c1b056a413f86e3efddac73551c.1750917800.git.vmalik@redhat.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
If an AUX event overruns, the event core layer intends to disable the
event by setting the 'pending_disable' flag. Unfortunately, the event
is not actually disabled afterwards.
In commit:
ca6c21327c ("perf: Fix missing SIGTRAPs")
the 'pending_disable' flag was changed to a boolean. However, the
AUX event code was not updated accordingly. The flag ends up holding a
CPU number. If this number is zero, the flag is taken as false and the
IRQ work is never triggered.
Later, with commit:
2b84def990 ("perf: Split __perf_pending_irq() out of perf_pending_irq()")
a new IRQ work 'pending_disable_irq' was introduced to handle event
disabling. The AUX event path was not updated to kick off the work queue.
To fix this bug, when an AUX ring buffer overrun is detected, call
perf_event_disable_inatomic() to initiate the pending disable flow.
Also update the outdated comment for setting the flag, to reflect the
boolean values (0 or 1).
Fixes: 2b84def990 ("perf: Split __perf_pending_irq() out of perf_pending_irq()")
Fixes: ca6c21327c ("perf: Fix missing SIGTRAPs")
Signed-off-by: Leo Yan <leo.yan@arm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: James Clark <james.clark@linaro.org>
Reviewed-by: Yeoreum Yun <yeoreum.yun@arm.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Ian Rogers <irogers@google.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Liang Kan <kan.liang@linux.intel.com>
Cc: Marco Elver <elver@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: linux-perf-users@vger.kernel.org
Link: https://lore.kernel.org/r/20250625170737.2918295-1-leo.yan@arm.com