BPF stream kfuncs need to be non-sleeping as they can be called from
programs running in any context, this requires a way to allocate memory
from any context. Currently, this is done by a custom per-CPU NMI-safe
bump allocation mechanism, backed by alloc_pages_nolock() and
free_pages_nolock() primitives.
As kmalloc_nolock() and kfree_nolock() primitives are available now, the
custom allocator can be removed in favor of these.
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20251023161448.4263-1-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Disable propagation and unwinding of the waiter queue in case the head
waiter detects a deadlock condition, but keep it enabled in case of the
timeout fallback.
Currently, when the head waiter experiences an AA deadlock, it will
signal all its successors in the queue to exit with an error. This is
not ideal for cases where the same lock is held in contexts which can
cause errors in an unrestricted fashion (e.g., BPF programs, or kernel
paths invoked through BPF programs), and core kernel logic which is
written in a correct fashion and does not expect deadlocks.
The same reasoning can be extended to ABBA situations. Depending on the
actual runtime schedule, one or both of the head waiters involved in an
ABBA situation can detect and exit directly without terminating their
waiter queue. If the ABBA situation manifests again, the waiters will
keep exiting until progress can be made, or a timeout is triggered in
case of more complicated locking dependencies.
We still preserve the queue destruction in case of timeouts, as either
the locking dependencies are too complex to be captured by AA and ABBA
heuristics, or the owner is perpetually stuck. As such, it would be
unwise to continue to apply the timeout for each new head waiter without
terminating the queue, since we may end up waiting for more than 250 ms
in aggregate with all participants in the locking transaction.
The patch itself is fairly simple; we can simply signal our successor to
become the next head waiter, and leave the queue without attempting to
acquire the lock.
With this change, the behavior for waiters in case of deadlocks
experienced by a predecessor changes. It is guaranteed that call sites
will no longer receive errors if the predecessors encounter deadlocks
and the successors do not participate in one. This should lower the
failure rate for waiters that are not doing improper locking opreations,
just because they were unlucky to queue behind a misbehaving waiter.
However, timeouts are still a possibility, hence they must be accounted
for, so users cannot rely upon errors not occuring at all.
Suggested-by: Amery Hung <ameryhung@gmail.com>
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20251029181828.231529-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Extract the duplicated maximum allowed depth computation for stack
traces stored in BPF stacks from bpf_get_stackid() and __bpf_get_stack()
into a dedicated stack_map_calculate_max_depth() helper function.
This unifies the logic for:
- The max depth computation
- Enforcing the sysctl_perf_event_max_stack limit
No functional changes for existing code paths.
Signed-off-by: Arnaud Lecomte <contact@arnaud-lcm.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/bpf/20251025192858.31424-1-contact@arnaud-lcm.com
File dynptr reads may sleep when the requested folios are not in
the page cache. To avoid sleeping in non-sleepable contexts while still
supporting valid sleepable use, given that dynptrs are non-sleepable by
default, enable sleeping only when bpf_dynptr_from_file() is invoked
from a sleepable context.
This change:
* Introduces a sleepable constructor: bpf_dynptr_from_file_sleepable()
* Override non-sleepable constructor with sleepable if it's always
called in sleepable context
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20251026203853.135105-10-mykyta.yatsenko5@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Move kfunc specialization (function address substitution) to later stage
of verification to support a new use case, where we need to take into
consideration whether kfunc is called in sleepable context.
Minor refactoring in add_kfunc_call(), making sure that if function
fails, kfunc desc is not added to tab->descs (previously it could be
added or not, depending on what failed).
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20251026203853.135105-9-mykyta.yatsenko5@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add support for file dynptr.
Introduce struct bpf_dynptr_file_impl to hold internal state for file
dynptrs, with 64-bit size and offset support.
Introduce lifecycle management kfuncs:
- bpf_dynptr_from_file() for initialization
- bpf_dynptr_file_discard() for destruction
Extend existing helpers to support file dynptrs in:
- bpf_dynptr_read()
- bpf_dynptr_slice()
Write helpers (bpf_dynptr_write() and bpf_dynptr_data()) are not
modified, as file dynptr is read-only.
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20251026203853.135105-8-mykyta.yatsenko5@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add the necessary verifier plumbing for the new file-backed dynptr type.
Introduce two kfuncs for its lifecycle management:
* bpf_dynptr_from_file() for initialization
* bpf_dynptr_file_discard() for destruction
Currently there is no mechanism for kfunc to release dynptr, this patch
add one:
* Dynptr release function sets meta->release_regno
* Call unmark_stack_slots_dynptr() if meta->release_regno is set and
dynptr ref_obj_id is set as well.
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20251026203853.135105-7-mykyta.yatsenko5@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Dynptr currently caps size and offset at 24 bits, which isn’t sufficient
for file-backed use cases; even 32 bits can be limiting. Refactor dynptr
helpers/kfuncs to use 64-bit size and offset, ensuring consistency
across the APIs.
This change does not affect internals of xdp, skb or other dynptrs,
which continue to behave as before. Also it does not break binary
compatibility.
The widening enables large-file access support via dynptr, implemented
in the next patches.
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20251026203853.135105-3-mykyta.yatsenko5@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The bpf_insn_successors() function is used to return successors
to a BPF instruction. So far, an instruction could have 0, 1 or 2
successors. Prepare the verifier code to introduction of instructions
with more than 2 successors (namely, indirect jumps).
To do this, introduce a new struct, struct bpf_iarray, containing
an array of bpf instruction indexes and make bpf_insn_successors
to return a pointer of that type. The storage for all instructions
is allocated in the env->succ, which holds an array of size 2,
to be used for all instructions.
Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20251019202145.3944697-10-a.s.protopopov@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The kernel/bpf/array.c file defines the array_map_get_next_key()
function which finds the next key for array maps. It actually doesn't
use any map fields besides the generic max_entries field. Generalize
it, and export as bpf_array_get_next_key() such that it can be
re-used by other array-like maps.
Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20251019202145.3944697-4-a.s.protopopov@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Introduce a new subprog_start field in bpf_prog_aux. This field may
be used by JIT compilers wanting to know the real absolute xlated
offset of the function being jitted. The func_info[func_id] may have
served this purpose, but func_info may be NULL, so JIT compilers
can't rely on it.
Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20251019202145.3944697-3-a.s.protopopov@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
propagate_to_outer_instance() calls get_outer_instance() and uses the
returned pointer to reset and commit stack write marks. Under normal
conditions, update_instance() guarantees that an outer instance exists,
so get_outer_instance() cannot return an ERR_PTR.
However, explicitly checking for IS_ERR(outer_instance) makes this code
more robust and self-documenting. It reduces cognitive load when reading
the control flow and silences potential false-positive reports from
static analysis or automated tooling.
No functional change intended.
Signed-off-by: Shardul Bankar <shardulsb08@gmail.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20251021080849.860072-1-shardulsb08@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The vma->vm_mm might be NULL and it can be accessed outside of RCU. Thus,
we can mark it as trusted_or_null. With this change, BPF helpers can safely
access vma->vm_mm to retrieve the associated mm_struct from the VMA.
Then we can make policy decision from the VMA.
The "trusted" annotation enables direct access to vma->vm_mm within kfuncs
marked with KF_TRUSTED_ARGS or KF_RCU, such as bpf_task_get_cgroup1() and
bpf_task_under_cgroup(). Conversely, "null" enforcement requires all
callsites using vma->vm_mm to perform NULL checks.
The lsm selftest must be modified because it directly accesses vma->vm_mm
without a NULL pointer check; otherwise it will break due to this
change.
For the VMA based THP policy, the use case is as follows,
@mm = @vma->vm_mm; // vm_area_struct::vm_mm is trusted or null
if (!@mm)
return;
bpf_rcu_read_lock(); // rcu lock must be held to dereference the owner
@owner = @mm->owner; // mm_struct::owner is rcu trusted or null
if (!@owner)
goto out;
@cgroup1 = bpf_task_get_cgroup1(@owner, MEMCG_HIERARCHY_ID);
/* make the decision based on the @cgroup1 attribute */
bpf_cgroup_release(@cgroup1); // release the associated cgroup
out:
bpf_rcu_read_unlock();
PSI memory information can be obtained from the associated cgroup to inform
policy decisions. Since upstream PSI support is currently limited to cgroup
v2, the following example demonstrates cgroup v2 implementation:
@owner = @mm->owner;
if (@owner) {
// @ancestor_cgid is user-configured
@ancestor = bpf_cgroup_from_id(@ancestor_cgid);
if (bpf_task_under_cgroup(@owner, @ancestor)) {
@psi_group = @ancestor->psi;
/* Extract PSI metrics from @psi_group and
* implement policy logic based on the values
*/
}
}
The vma::vm_file can also be marked with __safe_trusted_or_null.
No additional selftests are required since vma->vm_file and vma->vm_mm are
already validated in the existing selftest suite.
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Link: https://lore.kernel.org/r/20251016063929.13830-3-laoar.shao@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
When CONFIG_MEMCG is enabled, we can access mm->owner under RCU. The
owner can be NULL. With this change, BPF helpers can safely access
mm->owner to retrieve the associated task from the mm. We can then make
policy decision based on the task attribute.
The typical use case is as follows,
bpf_rcu_read_lock(); // rcu lock must be held for rcu trusted field
@owner = @mm->owner; // mm_struct::owner is rcu trusted or null
if (!@owner)
goto out;
/* Do something based on the task attribute */
out:
bpf_rcu_read_unlock();
Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Link: https://lore.kernel.org/r/20251016063929.13830-2-laoar.shao@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
When __lookup_instance() allocates a func_instance structure but fails
to allocate the must_write_set array, it returns an error without freeing
the previously allocated func_instance. This causes a memory leak of 192
bytes (sizeof(struct func_instance)) each time this error path is triggered.
Fix by freeing 'result' on must_write_set allocation failure.
Fixes: b3698c356a ("bpf: callchain sensitive stack liveness tracking using CFG")
Reported-by: BPF Runtime Fuzzer (BRF)
Signed-off-by: Shardul Bankar <shardulsb08@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://patch.msgid.link/20251016063330.4107547-1-shardulsb08@gmail.com
The following kmemleak splat:
[ 8.105530] kmemleak: Trying to color unknown object at 0xff11000100e918c0 as Black
[ 8.106521] Call Trace:
[ 8.106521] <TASK>
[ 8.106521] dump_stack_lvl+0x4b/0x70
[ 8.106521] kvfree_call_rcu+0xcb/0x3b0
[ 8.106521] ? hrtimer_cancel+0x21/0x40
[ 8.106521] bpf_obj_free_fields+0x193/0x200
[ 8.106521] htab_map_update_elem+0x29c/0x410
[ 8.106521] bpf_prog_cfc8cd0f42c04044_overwrite_cb+0x47/0x4b
[ 8.106521] bpf_prog_8c30cd7c4db2e963_overwrite_timer+0x65/0x86
[ 8.106521] bpf_prog_test_run_syscall+0xe1/0x2a0
happens due to the combination of features and fixes, but mainly due to
commit 6d78b4473c ("bpf: Tell memcg to use allow_spinning=false path in bpf_timer_init()")
It's using __GFP_HIGH, which instructs slub/kmemleak internals to skip
kmemleak_alloc_recursive() on allocation, so subsequent kfree_rcu()->
kvfree_call_rcu()->kmemleak_ignore() complains with the above splat.
To fix this imbalance, replace bpf_map_kmalloc_node() with
kmalloc_nolock() and kfree_rcu() with call_rcu() + kfree_nolock() to
make sure that the objects allocated with kmalloc_nolock() are freed
with kfree_nolock() rather than the implicit kfree() that kfree_rcu()
uses internally.
Note, the kmalloc_nolock() happens under bpf_spin_lock_irqsave(), so
it will always fail in PREEMPT_RT. This is not an issue at the moment,
since bpf_timers are disabled in PREEMPT_RT. In the future
bpf_spin_lock will be replaced with state machine similar to
bpf_task_work.
Fixes: 6d78b4473c ("bpf: Tell memcg to use allow_spinning=false path in bpf_timer_init()")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: Harry Yoo <harry.yoo@oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: linux-mm@kvack.org
Link: https://lore.kernel.org/bpf/20251015000700.28988-1-alexei.starovoitov@gmail.com
Pull tracing fixes from Steven Rostedt:
"The previous fix to trace_marker required updating trace_marker_raw as
well. The difference between trace_marker_raw from trace_marker is
that the raw version is for applications to write binary structures
directly into the ring buffer instead of writing ASCII strings. This
is for applications that will read the raw data from the ring buffer
and get the data structures directly. It's a bit quicker than using
the ASCII version.
Unfortunately, it appears that our test suite has several tests that
test writes to the trace_marker file, but lacks any tests to the
trace_marker_raw file (this needs to be remedied). Two issues came
about the update to the trace_marker_raw file that syzbot found:
- Fix tracing_mark_raw_write() to use per CPU buffer
The fix to use the per CPU buffer to copy from user space was
needed for both the trace_maker and trace_maker_raw file.
The fix for reading from user space into per CPU buffers properly
fixed the trace_marker write function, but the trace_marker_raw
file wasn't fixed properly. The user space data was correctly
written into the per CPU buffer, but the code that wrote into the
ring buffer still used the user space pointer and not the per CPU
buffer that had the user space data already written.
- Stop the fortify string warning from writing into trace_marker_raw
After converting the copy_from_user_nofault() into a memcpy(),
another issue appeared. As writes to the trace_marker_raw expects
binary data, the first entry is a 4 byte identifier. The entry
structure is defined as:
struct {
struct trace_entry ent;
int id;
char buf[];
};
The size of this structure is reserved on the ring buffer with:
size = sizeof(*entry) + cnt;
Then it is copied from the buffer into the ring buffer with:
memcpy(&entry->id, buf, cnt);
This use to be a copy_from_user_nofault(), but now converting it to
a memcpy() triggers the fortify-string code, and causes a warning.
The allocated space is actually more than what is copied, as the
cnt used also includes the entry->id portion. Allocating
sizeof(*entry) plus cnt is actually allocating 4 bytes more than
what is needed.
Change the size function to:
size = struct_size(entry, buf, cnt - sizeof(entry->id));
And update the memcpy() to unsafe_memcpy()"
* tag 'trace-v6.18-3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing: Stop fortify-string from warning in tracing_mark_raw_write()
tracing: Fix tracing_mark_raw_write() to use buf and not ubuf
Pull more updates from Andrew Morton:
"Just one series here - Mike Rappoport has taught KEXEC handover to
preserve vmalloc allocations across handover"
* tag 'mm-nonmm-stable-2025-10-10-15-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
lib/test_kho: use kho_preserve_vmalloc instead of storing addresses in fdt
kho: add support for preserving vmalloc allocations
kho: replace kho_preserve_phys() with kho_preserve_pages()
kho: check if kho is finalized in __kho_preserve_order()
MAINTAINERS, .mailmap: update Umang's email address
The fix to use a per CPU buffer to read user space tested only the writes
to trace_marker. But it appears that the selftests are missing tests to
the trace_maker_raw file. The trace_maker_raw file is used by applications
that writes data structures and not strings into the file, and the tools
read the raw ring buffer to process the structures it writes.
The fix that reads the per CPU buffers passes the new per CPU buffer to
the trace_marker file writes, but the update to the trace_marker_raw write
read the data from user space into the per CPU buffer, but then still used
then passed the user space address to the function that records the data.
Pass in the per CPU buffer and not the user space address.
TODO: Add a test to better test trace_marker_raw.
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/20251011035243.386098147@kernel.org
Fixes: 64cf7d058a ("tracing: Have trace_marker use per-cpu data to read user space")
Reported-by: syzbot+9a2ede1643175f350105@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/68e973f5.050a0220.1186a4.0010.GAE@google.com/
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The arraymap and hashtab duplicate the logic that checks for and frees
internal structs (timer, workqueue, task_work) based on
BTF record flags. Centralize this by introducing two helpers:
* bpf_map_has_internal_structs(map)
Returns true if the map value contains any of internal structs:
BPF_TIMER | BPF_WORKQUEUE | BPF_TASK_WORK.
* bpf_map_free_internal_structs(map, obj)
Frees the internal structs for a single value object.
Convert arraymap and both the prealloc/malloc hashtab paths to use the
new generic functions. This keeps the functionality for when/how to free
these special fields in one place and makes it easier to add support for
new internal structs in the future without touching every map
implementation.
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20251010164606.147298-3-mykyta.yatsenko5@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
When unpinning a BPF hash table (htab or htab_lru) that contains internal
structures (timer, workqueue, or task_work) in its values, a BUG warning
is triggered:
BUG: sleeping function called from invalid context at kernel/bpf/hashtab.c:244
in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 14, name: ksoftirqd/0
...
The issue arises from the interaction between BPF object unpinning and
RCU callback mechanisms:
1. BPF object unpinning uses ->free_inode() which schedules cleanup via
call_rcu(), deferring the actual freeing to an RCU callback that
executes within the RCU_SOFTIRQ context.
2. During cleanup of hash tables containing internal structures,
htab_map_free_internal_structs() is invoked, which includes
cond_resched() or cond_resched_rcu() calls to yield the CPU during
potentially long operations.
However, cond_resched() or cond_resched_rcu() cannot be safely called from
atomic RCU softirq context, leading to the BUG warning when attempting
to reschedule.
Fix this by changing from ->free_inode() to ->destroy_inode() and rename
bpf_free_inode() to bpf_destroy_inode() for BPF objects (prog, map, link).
This allows direct inode freeing without RCU callback scheduling,
avoiding the invalid context warning.
Reported-by: Le Chen <tom2cat@sjtu.edu.cn>
Closes: https://lore.kernel.org/all/1444123482.1827743.1750996347470.JavaMail.zimbra@sjtu.edu.cn/
Fixes: 68134668c1 ("bpf: Add map side support for bpf timers.")
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: KaFai Wan <kafai.wan@linux.dev>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20251008102628.808045-2-kafai.wan@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Rename the storage_get_func_atomic flag to a more generic non_sleepable
flag that tracks whether a helper or kfunc may be called from a
non-sleepable context. This makes the flag more broadly applicable
beyond just storage_get helpers. See [0] for more context.
The flag is now set unconditionally for all helpers and kfuncs when:
- RCU critical section is active.
- Preemption is disabled.
- IRQs are disabled.
- In a non-sleepable context within a sleepable program (e.g., timer
callbacks), which is indicated by !in_sleepable().
Previously, the flag was only set for storage_get helpers in these
contexts. With this change, it can be used by any code that needs to
differentiate between sleepable and non-sleepable contexts at the
per-instruction level.
The existing usage in do_misc_fixups() for storage_get helpers is
preserved by checking is_storage_get_function() before using the flag.
[0]: https://lore.kernel.org/bpf/CAP01T76cbaNi4p-y8E0sjE2NXSra2S=Uja8G4hSQDu_SbXxREQ@mail.gmail.com
Cc: Mykyta Yatsenko <yatsenko@meta.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Mykyta Yatsenko <mykyta.yatsenko5@gmail.com>
Link: https://lore.kernel.org/r/20251007220349.3852807-3-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Fix the BPF verifier to correctly determine the sleepable context of
async callbacks based on the async primitive type rather than the arming
program's context.
The bug is in in_sleepable() which uses OR logic to check if the current
execution context is sleepable. When a sleepable program arms a timer
callback, the callback's state correctly has in_sleepable=false, but
in_sleepable() would still return true due to env->prog->sleepable being
true. This incorrectly allows sleepable helpers like
bpf_copy_from_user() inside timer callbacks when armed from sleepable
programs, even though timer callbacks always execute in non-sleepable
context.
Fix in_sleepable() to rely solely on env->cur_state->in_sleepable, and
initialize state->in_sleepable to env->prog->sleepable in
do_check_common() for the main program entry. This ensures the sleepable
context is properly tracked per verification state rather than being
overridden by the program's sleepability.
The env->cur_state NULL check in in_sleepable() was only needed for
do_misc_fixups() which runs after verification when env->cur_state is
set to NULL. Update do_misc_fixups() to use env->prog->sleepable
directly for the storage_get_function check, and remove the redundant
NULL check from in_sleepable().
Introduce is_async_cb_sleepable() helper to explicitly determine async
callback sleepability based on the primitive type:
- bpf_timer callbacks are never sleepable
- bpf_wq and bpf_task_work callbacks are always sleepable
Add verifier_bug() check to catch unhandled async callback types,
ensuring future additions cannot be silently mishandled. Move the
is_task_work_add_kfunc() forward declaration to the top alongside other
callback-related helpers. We update push_async_cb() to adjust to the new
changes.
At the same time, while simplifying in_sleepable(), we notice a problem
in do_misc_fixups. Fix storage_get helpers to use GFP_ATOMIC when called
from non-sleepable contexts within sleepable programs, such as bpf_timer
callbacks.
Currently, the check in do_misc_fixups assumes that env->prog->sleepable,
previously in_sleepable(env) which only resolved to this check before
last commit, holds across the program's execution, but that is not true.
Instead, the func_atomic bit must be set whenever we see the function
being called in an atomic context. Previously, this is being done when
the helper is invoked in atomic contexts in sleepable programs, we can
simply just set the value to true without doing an in_sleepable() check.
We must also do a standalone in_sleepable() check to handle cases where
the async callback itself is armed from a sleepable program, but is
itself non-sleepable (e.g., timer callback) and invokes such a helper,
thus needing the func_atomic bit to be true for the said call.
Adjust do_misc_fixups() to drop any checks regarding sleepable nature of
the program, and just depend on the func_atomic bit to decide which GFP
flag to pass.
Fixes: 81f1d7a583 ("bpf: wq: add bpf_wq_set_callback_impl")
Fixes: b00fa38a9c ("bpf: Enable non-atomic allocations in local storage")
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20251007220349.3852807-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Pull tracing clean up and fixes from Steven Rostedt:
- Have osnoise tracer use memdup_user_nul()
The function osnoise_cpus_write() open codes a kmalloc() and then a
copy_from_user() and then adds a nul byte at the end which is the
same as simply using memdup_user_nul().
- Fix wakeup and irq tracers when failing to acquire calltime
When the wakeup and irq tracers use the function graph tracer for
tracing function times, it saves a timestamp into the fgraph shadow
stack. It is possible that this could fail to be stored. If that
happens, it exits the routine early. These functions also disable
nesting of the operations by incremeting the data "disable" counter.
But if the calltime exits out early, it never increments the counter
back to what it needs to be.
Since there's only a couple of lines of code that does work after
acquiring the calltime, instead of exiting out early, reverse the if
statement to be true if calltime is acquired, and place the code that
is to be done within that if block. The clean up will always be done
after that.
- Fix ring_buffer_map() return value on failure of __rb_map_vma()
If __rb_map_vma() fails in ring_buffer_map(), it does not return an
error. This means the caller will be working against a bad vma
mapping. Have ring_buffer_map() return an error when __rb_map_vma()
fails.
- Fix regression of writing to the trace_marker file
A bug fix was made to change __copy_from_user_inatomic() to
copy_from_user_nofault() in the trace_marker write function. The
trace_marker file is used by applications to write into it (usually
with a file descriptor opened at the start of the program) to record
into the tracing system. It's usually used in critical sections so
the write to trace_marker is highly optimized.
The reason for copying in an atomic section is that the write
reserves space on the ring buffer and then writes directly into it.
After it writes, it commits the event. The time between reserve and
commit must have preemption disabled.
The trace marker write does not have any locking nor can it allocate
due to the nature of it being a critical path.
Unfortunately, converting __copy_from_user_inatomic() to
copy_from_user_nofault() caused a regression in Android. Now all the
writes from its applications trigger the fault that is rejected by
the _nofault() version that wasn't rejected by the _inatomic()
version. Instead of getting data, it now just gets a trace buffer
filled with:
tracing_mark_write: <faulted>
To fix this, on opening of the trace_marker file, allocate per CPU
buffers that can be used by the write call. Then when entering the
write call, do the following:
preempt_disable();
cpu = smp_processor_id();
buffer = per_cpu_ptr(cpu_buffers, cpu);
do {
cnt = nr_context_switches_cpu(cpu);
migrate_disable();
preempt_enable();
ret = copy_from_user(buffer, ptr, size);
preempt_disable();
migrate_enable();
} while (!ret && cnt != nr_context_switches_cpu(cpu));
if (!ret)
ring_buffer_write(buffer);
preempt_enable();
This works similarly to seqcount. As it must enabled preemption to do
a copy_from_user() into a per CPU buffer, if it gets preempted, the
buffer could be corrupted by another task.
To handle this, read the number of context switches of the current
CPU, disable migration, enable preemption, copy the data from user
space, then immediately disable preemption again. If the number of
context switches is the same, the buffer is still valid. Otherwise it
must be assumed that the buffer may have been corrupted and it needs
to try again.
Now the trace_marker write can get the user data even if it has to
fault it in, and still not grab any locks of its own.
* tag 'trace-v6.18-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing: Have trace_marker use per-cpu data to read user space
ring buffer: Propagate __rb_map_vma return value to caller
tracing: Fix irqoff tracers on failure of acquiring calltime
tracing: Fix wakeup tracers on failure of acquiring calltime
tracing/osnoise: Replace kmalloc + copy_from_user with memdup_user_nul
It was reported that using __copy_from_user_inatomic() can actually
schedule. Which is bad when preemption is disabled. Even though there's
logic to check in_atomic() is set, but this is a nop when the kernel is
configured with PREEMPT_NONE. This is due to page faulting and the code
could schedule with preemption disabled.
Link: https://lore.kernel.org/all/20250819105152.2766363-1-luogengkun@huaweicloud.com/
The solution was to change the __copy_from_user_inatomic() to
copy_from_user_nofault(). But then it was reported that this caused a
regression in Android. There's several applications writing into
trace_marker() in Android, but now instead of showing the expected data,
it is showing:
tracing_mark_write: <faulted>
After reverting the conversion to copy_from_user_nofault(), Android was
able to get the data again.
Writes to the trace_marker is a way to efficiently and quickly enter data
into the Linux tracing buffer. It takes no locks and was designed to be as
non-intrusive as possible. This means it cannot allocate memory, and must
use pre-allocated data.
A method that is actively being worked on to have faultable system call
tracepoints read user space data is to allocate per CPU buffers, and use
them in the callback. The method uses a technique similar to seqcount.
That is something like this:
preempt_disable();
cpu = smp_processor_id();
buffer = this_cpu_ptr(&pre_allocated_cpu_buffers, cpu);
do {
cnt = nr_context_switches_cpu(cpu);
migrate_disable();
preempt_enable();
ret = copy_from_user(buffer, ptr, size);
preempt_disable();
migrate_enable();
} while (!ret && cnt != nr_context_switches_cpu(cpu));
if (!ret)
ring_buffer_write(buffer);
preempt_enable();
It's a little more involved than that, but the above is the basic logic.
The idea is to acquire the current CPU buffer, disable migration, and then
enable preemption. At this moment, it can safely use copy_from_user().
After reading the data from user space, it disables preemption again. It
then checks to see if there was any new scheduling on this CPU. If there
was, it must assume that the buffer was corrupted by another task. If
there wasn't, then the buffer is still valid as only tasks in preemptable
context can write to this buffer and only those that are running on the
CPU.
By using this method, where trace_marker open allocates the per CPU
buffers, trace_marker writes can access user space and even fault it in,
without having to allocate or take any locks of its own.
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Luo Gengkun <luogengkun@huaweicloud.com>
Cc: Wattson CI <wattson-external@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/20251008124510.6dba541a@gandalf.local.home
Fixes: 3d62ab32df ("tracing: Fix tracing_marker may trigger page fault during preempt_disable")
Reported-by: Runping Lai <runpinglai@google.com>
Tested-by: Runping Lai <runpinglai@google.com>
Closes: https://lore.kernel.org/linux-trace-kernel/20251007003417.3470979-2-runpinglai@google.com/
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
A vmalloc allocation is preserved using binary structure similar to global
KHO memory tracker. It's a linked list of pages where each page is an
array of physical address of pages in vmalloc area.
kho_preserve_vmalloc() hands out the physical address of the head page to
the caller. This address is used as the argument to kho_vmalloc_restore()
to restore the mapping in the vmalloc address space and populate it with
the preserved pages.
[pasha.tatashin@soleen.com: free chunks using free_page() not kfree()]
Link: https://lkml.kernel.org/r/mafs0a52idbeg.fsf@kernel.org
[akpm@linux-foundation.org: coding-style cleanups]
Link: https://lkml.kernel.org/r/20250921054458.4043761-4-rppt@kernel.org
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Cc: Alexander Graf <graf@amazon.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Changyuan Lyu <changyuanl@google.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "kho: add support for preserving vmalloc allocations", v5.
Following the discussion about preservation of memfd with LUO [1] these
patches add support for preserving vmalloc allocations.
Any KHO uses case presumes that there's a data structure that lists
physical addresses of preserved folios (and potentially some additional
metadata). Allowing vmalloc preservations with KHO allows scalable
preservation of such data structures.
For instance, instead of allocating array describing preserved folios in
the fdt, memfd preservation can use vmalloc:
preserved_folios = vmalloc_array(nr_folios, sizeof(*preserved_folios));
memfd_luo_preserve_folios(preserved_folios, folios, nr_folios);
kho_preserve_vmalloc(preserved_folios, &folios_info);
This patch (of 4):
Instead of checking if kho is finalized in each caller of
__kho_preserve_order(), do it in the core function itself.
Link: https://lkml.kernel.org/r/20250921054458.4043761-1-rppt@kernel.org
Link: https://lkml.kernel.org/r/20250921054458.4043761-2-rppt@kernel.org
Link: https://lore.kernel.org/all/20250807014442.3829950-30-pasha.tatashin@soleen.com [1]
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Cc: Alexander Graf <graf@amazon.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Changyuan Lyu <changyuanl@google.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Pull hyperv updates from Wei Liu:
- Unify guest entry code for KVM and MSHV (Sean Christopherson)
- Switch Hyper-V MSI domain to use msi_create_parent_irq_domain()
(Nam Cao)
- Add CONFIG_HYPERV_VMBUS and limit the semantics of CONFIG_HYPERV
(Mukesh Rathor)
- Add kexec/kdump support on Azure CVMs (Vitaly Kuznetsov)
- Deprecate hyperv_fb in favor of Hyper-V DRM driver (Prasanna
Kumar T S M)
- Miscellaneous enhancements, fixes and cleanups (Abhishek Tiwari,
Alok Tiwari, Nuno Das Neves, Wei Liu, Roman Kisel, Michael Kelley)
* tag 'hyperv-next-signed-20251006' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
hyperv: Remove the spurious null directive line
MAINTAINERS: Mark hyperv_fb driver Obsolete
fbdev/hyperv_fb: deprecate this in favor of Hyper-V DRM driver
Drivers: hv: Make CONFIG_HYPERV bool
Drivers: hv: Add CONFIG_HYPERV_VMBUS option
Drivers: hv: vmbus: Fix typos in vmbus_drv.c
Drivers: hv: vmbus: Fix sysfs output format for ring buffer index
Drivers: hv: vmbus: Clean up sscanf format specifier in target_cpu_store()
x86/hyperv: Switch to msi_create_parent_irq_domain()
mshv: Use common "entry virt" APIs to do work in root before running guest
entry: Rename "kvm" entry code assets to "virt" to genericize APIs
entry/kvm: KVM: Move KVM details related to signal/-EINTR into KVM proper
mshv: Handle NEED_RESCHED_LAZY before transferring to guest
x86/hyperv: Add kexec/kdump support on Azure CVMs
Drivers: hv: Simplify data structures for VMBus channel close message
Drivers: hv: util: Cosmetic changes for hv_utils_transport.c
mshv: Add support for a new parent partition configuration
clocksource: hyper-v: Skip unnecessary checks for the root partition
hyperv: Add missing field to hv_output_map_device_interrupt
Pull tracing updates from Steven Rostedt:
- Use READ_ONCE() and WRITE_ONCE() instead of RCU for syscall
tracepoints
Individual system call trace events are pseudo events attached to the
raw_syscall trace events that just trace the entry and exit of all
system calls. When any of these individual system call trace events
get enabled, an element in an array indexed by the system call number
is assigned to the trace file that defines how to trace it. When the
trace event triggers, it reads this array and if the array has an
element, it uses that trace file to know what to write it (the trace
file defines the output format of the corresponding system call).
The issue is that it uses rcu_dereference_ptr() and marks the
elements of the array as using RCU. This is incorrect. There is no
RCU synchronization here. The event file that is pointed to has a
completely different way to make sure its freed properly. The reading
of the array during the system call trace event is only to know if
there is a value or not. If not, it does nothing (it means this
system call isn't being traced). If it does, it uses the information
to store the system call data.
The RCU usage here can simply be replaced by READ_ONCE() and
WRITE_ONCE() macros.
- Have the system call trace events use "0x" for hex values
Some system call trace events display hex values but do not have "0x"
in front of it. Seeing "count: 44" can be assumed that it is 44
decimal when in actuality it is 44 hex (68 decimal). Display "0x44"
instead.
- Use vmalloc_array() in tracing_map_sort_entries()
The function tracing_map_sort_entries() used array_size() and
vmalloc() when it could have simply used vmalloc_array().
- Use for_each_online_cpu() in trace_osnoise.c()
Instead of open coding for_each_cpu(cpu, cpu_online_mask), use
for_each_online_cpu().
- Move the buffer field in struct trace_seq to the end
The buffer field in struct trace_seq is architecture dependent in
size, and caused padding for the fields after it. By moving the
buffer to the end of the structure, it compacts the trace_seq
structure better.
- Remove redundant zeroing of cmdline_idx field in
saved_cmdlines_buffer()
The structure that contains cmdline_idx is zeroed by memset(), no
need to explicitly zero any of its fields after that.
- Use system_percpu_wq instead of system_wq in user_event_mm_remove()
As system_wq is being deprecated, use the new wq.
- Add cond_resched() is ftrace_module_enable()
Some modules have a lot of functions (thousands of them), and the
enabling of those functions can take some time. On non preemtable
kernels, it was triggering a watchdog timeout. Add a cond_resched()
to prevent that.
- Add a BUILD_BUG_ON() to make sure PID_MAX_DEFAULT is always a power
of 2
There's code that depends on PID_MAX_DEFAULT being a power of 2 or it
will break. If in the future that changes, make sure the build fails
to ensure that the code is fixed that depends on this.
- Grab mutex_lock() before ever exiting s_start()
The s_start() function is a seq_file start routine. As s_stop() is
always called even if s_start() fails, and s_stop() expects the
event_mutex to be held as it will always release it. That mutex must
always be taken in s_start() even if that function fails.
* tag 'trace-v6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing: Fix lock imbalance in s_start() memory allocation failure path
tracing: Ensure optimized hashing works
ftrace: Fix softlockup in ftrace_module_enable
tracing: replace use of system_wq with system_percpu_wq
tracing: Remove redundant 0 value initialization
tracing: Move buffer in trace_seq to end of struct
tracing/osnoise: Use for_each_online_cpu() instead of for_each_cpu()
tracing: Use vmalloc_array() to improve code
tracing: Have syscall trace events show "0x" for values greater than 10
tracing: Replace syscall RCU pointer assignment with READ/WRITE_ONCE()
Pull probe fix from Masami Hiramatsu:
- Fix race condition in kprobe initialization causing NULL pointer
dereference. This happens on weak memory model, which does not
correctly manage the flags access with appropriate memory barriers.
Use RELEASE-ACQUIRE to fix it.
* tag 'probes-fixes-v6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing: Fix race condition in kprobe initialization causing NULL pointer dereference
Pull crypto updates from Herbert Xu:
"Drivers:
- Add ciphertext hiding support to ccp
- Add hashjoin, gather and UDMA data move features to hisilicon
- Add lz4 and lz77_only to hisilicon
- Add xilinx hwrng driver
- Add ti driver with ecb/cbc aes support
- Add ring buffer idle and command queue telemetry for GEN6 in qat
Others:
- Use rcu_dereference_all to stop false alarms in rhashtable
- Fix CPU number wraparound in padata"
* tag 'v6.18-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (78 commits)
dt-bindings: rng: hisi-rng: convert to DT schema
crypto: doc - Add explicit title heading to API docs
hwrng: ks-sa - fix division by zero in ks_sa_rng_init
KEYS: X.509: Fix Basic Constraints CA flag parsing
crypto: anubis - simplify return statement in anubis_mod_init
crypto: hisilicon/qm - set NULL to qm->debug.qm_diff_regs
crypto: hisilicon/qm - clear all VF configurations in the hardware
crypto: hisilicon - enable error reporting again
crypto: hisilicon/qm - mask axi error before memory init
crypto: hisilicon/qm - invalidate queues in use
crypto: qat - Return pointer directly in adf_ctl_alloc_resources
crypto: aspeed - Fix dma_unmap_sg() direction
rhashtable: Use rcu_dereference_all and rcu_dereference_all_check
crypto: comp - Use same definition of context alloc and free ops
crypto: omap - convert from tasklet to BH workqueue
crypto: qat - Replace kzalloc() + copy_from_user() with memdup_user()
crypto: caam - double the entropy delay interval for retry
padata: WQ_PERCPU added to alloc_workqueue users
padata: replace use of system_unbound_wq with system_dfl_wq
crypto: cryptd - WQ_PERCPU added to alloc_workqueue users
...
Pull RCU updates from Paul McKenney:
"Documentation updates:
- Update whatisRCU.rst and checklist.rst for recent RCU API additions
- Fix RCU documentation formatting and typos
- Replace dead Ottawa Linux Symposium links in RTFP.txt
Miscellaneous RCU updates:
- Document that rcu_barrier() hurries RCU_LAZY callbacks
- Remove redundant interrupt disabling from
rcu_preempt_deferred_qs_handler()
- Move list_for_each_rcu from list.h to rculist.h, and adjust the
include directive in kernel/cgroup/dmem.c accordingly
- Make initial set of changes to accommodate upcoming
system_percpu_wq changes
SRCU updates:
- Create an srcu_read_lock_fast_notrace() for eventual use in
tracing, including adding guards
- Document the reliance on per-CPU operations as implicit RCU readers
in __srcu_read_{,un}lock_fast()
- Document the srcu_flip() function's memory-barrier D's relationship
to SRCU-fast readers
- Remove a redundant preempt_disable() and preempt_enable() pair from
srcu_gp_start_if_needed()
Torture-test updates:
- Fix jitter.sh spin time so that it actually varies as advertised.
It is still quite coarse-grained, but at least it does now vary
- Update torture.sh help text to include the not-so-new --do-normal
parameter, which permits (for example) testing KCSAN kernels
without doing non-debug kernels
- Fix a number of false-positive diagnostics that were being
triggered by rcutorture starting before boot completed. Running
multiple near-CPU-bound rcutorture processes when there is only the
boot CPU is after all a bit excessive
- Substitute kcalloc() for kzalloc()
- Remove a redundant kfree() and NULL out kfree()ed objects"
* tag 'rcu.2025.09.26a' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux: (31 commits)
rcu: WQ_UNBOUND added to sync_wq workqueue
rcu: WQ_PERCPU added to alloc_workqueue users
rcu: replace use of system_wq with system_percpu_wq
refperf: Set reader_tasks to NULL after kfree()
refperf: Remove redundant kfree() after torture_stop_kthread()
srcu/tiny: Remove preempt_disable/enable() in srcu_gp_start_if_needed()
srcu: Document srcu_flip() memory-barrier D relation to SRCU-fast
srcu: Document __srcu_read_{,un}lock_fast() implicit RCU readers
rculist: move list_for_each_rcu() to where it belongs
refscale: Use kcalloc() instead of kzalloc()
rcutorture: Use kcalloc() instead of kzalloc()
docs: rcu: Replace multiple dead OLS links in RTFP.txt
doc: Fix typo in RCU's torture.rst documentation
Documentation: RCU: Retitle toctree index
Documentation: RCU: Reduce toctree depth
Documentation: RCU: Wrap kvm-remote.sh rerun snippet in literal code block
rcu: docs: Requirements.rst: Abide by conventions of kernel documentation
doc: Add RCU guards to checklist.rst
doc: Update whatisRCU.rst for recent RCU API additions
rcutorture: Delay forward-progress testing until boot completes
...
Pull printk updates from Petr Mladek:
- Add KUnit test for the printk ring buffer
- Fix the check of the maximal record size which is allowed to be
stored into the printk ring buffer. It prevents corruptions of the
ring buffer.
Note that printk() is on the safe side. The messages are limited by
1kB buffer and are always small enough for the minimal log buffer
size 4kB, see CONFIG_LOG_BUF_SHIFT definition.
* tag 'printk-for-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux:
printk: ringbuffer: Fix data block max size check
printk: kunit: support offstack cpumask
printk: kunit: Fix __counted_by() in struct prbtest_rbdata
printk: ringbuffer: Explain why the KUnit test ignores failed writes
printk: ringbuffer: Add KUnit test