Warn in the case it is called with cpu == -1. This does not appear
to happen anywhere.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
The hrtimer function callback must not be NULL. It has to be specified by
the call side but it is not validated by the hrtimer code. When a hrtimer
is queued without a function callback, the kernel crashes with a null
pointer dereference when trying to execute the callback in __run_hrtimer().
Introduce a validation before queuing the hrtimer in
hrtimer_start_range_ns().
[anna-maria: Rephrase commit message]
Signed-off-by: Phil Chang <phil.chang@mediatek.com>
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Using sys_io_pgetevents() as the entry point for compat mode tasks
works almost correctly, but misses the sign extension for the min_nr
and nr arguments.
This was addressed on parisc by switching to
compat_sys_io_pgetevents_time64() in commit 6431e92fc8 ("parisc:
io_pgetevents_time64() needs compat syscall in 32-bit compat mode"),
as well as by using more sophisticated system call wrappers on x86 and
s390. However, arm64, mips, powerpc, sparc and riscv still have the
same bug.
Change all of them over to use compat_sys_io_pgetevents_time64()
like parisc already does. This was clearly the intention when the
function was originally added, but it got hooked up incorrectly in
the tables.
Cc: stable@vger.kernel.org
Fixes: 48166e6ea4 ("y2038: add 64-bit time_t syscalls to all 32-bit architectures")
Acked-by: Heiko Carstens <hca@linux.ibm.com> # s390
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
If hung_task_panic is enabled, don't consider the value of
hung_task_warnings and display the information of the hung tasks.
In some cases, hung_task_panic might not be initially set up, after
several hung tasks occur, the hung_task_warnings count reaches zero. If
hung_task_panic is set up later, it may not display any helpful hung task
info in dmesg, only showing messages like:
Kernel panic - not syncing: hung_task: blocked tasks
CPU: 3 PID: 58 Comm: khungtaskd Not tainted 6.10.0-rc3 #19
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
Call Trace:
<TASK>
panic+0x2f3/0x320
watchdog+0x2dd/0x510
? __pfx_watchdog+0x10/0x10
kthread+0xe0/0x110
? __pfx_kthread+0x10/0x10
ret_from_fork+0x2f/0x40
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1a/0x30
</TASK>
Link: https://lkml.kernel.org/r/20240613033159.3446265-1-leonylgao@gmail.com
Signed-off-by: Yongliang Gao <leonylgao@tencent.com>
Reviewed-by: Huang Cun <cunhuang@tencent.com>
Cc: Joel Granados <j.granados@samsung.com>
Cc: John Siddle <jsiddle@redhat.com>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Daniel Borkmann says:
====================
pull-request: bpf 2024-06-24
We've added 12 non-merge commits during the last 10 day(s) which contain
a total of 10 files changed, 412 insertions(+), 16 deletions(-).
The main changes are:
1) Fix a BPF verifier issue validating may_goto with a negative offset,
from Alexei Starovoitov.
2) Fix a BPF verifier validation bug with may_goto combined with jump to
the first instruction, also from Alexei Starovoitov.
3) Fix a bug with overrunning reservations in BPF ring buffer,
from Daniel Borkmann.
4) Fix a bug in BPF verifier due to missing proper var_off setting related
to movsx instruction, from Yonghong Song.
5) Silence unnecessary syzkaller-triggered warning in __xdp_reg_mem_model(),
from Daniil Dulov.
* tag 'for-netdev' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
xdp: Remove WARN() from __xdp_reg_mem_model()
selftests/bpf: Add tests for may_goto with negative offset.
bpf: Fix may_goto with negative offset.
selftests/bpf: Add more ring buffer test coverage
bpf: Fix overrunning reservations in ringbuf
selftests/bpf: Tests with may_goto and jumps to the 1st insn
bpf: Fix the corner case with may_goto and jump to the 1st insn.
bpf: Update BPF LSM maintainer list
bpf: Fix remap of arena.
selftests/bpf: Add a few tests to cover
bpf: Add missed var_off setting in coerce_subreg_to_size_sx()
bpf: Add missed var_off setting in set_sext32_default_val()
====================
Link: https://patch.msgid.link/20240624124330.8401-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When kernel has pending uretprobes installed, it hijacks original user
function return address on the stack with a uretprobe trampoline
address. There could be multiple such pending uretprobes (either on
different user functions or on the same recursive one) at any given
time within the same task.
This approach interferes with the user stack trace capture logic, which
would report suprising addresses (like 0x7fffffffe000) that correspond
to a special "[uprobes]" section that kernel installs in the target
process address space for uretprobe trampoline code, while logically it
should be an address somewhere within the calling function of another
traced user function.
This is easy to correct for, though. Uprobes subsystem keeps track of
pending uretprobes and records original return addresses. This patch is
using this to do a post-processing step and restore each trampoline
address entries with correct original return address. This is done only
if there are pending uretprobes for current task.
This is a similar approach to what fprobe/kretprobe infrastructure is
doing when capturing kernel stack traces in the presence of pending
return probes.
Link: https://lore.kernel.org/all/20240522013845.1631305-3-andrii@kernel.org/
Reported-by: Riham Selim <rihams@meta.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
The XDP redirect process is two staged:
- bpf_prog_run_xdp() is invoked to run a eBPF program which inspects the
packet and makes decisions. While doing that, the per-CPU variable
bpf_redirect_info is used.
- Afterwards xdp_do_redirect() is invoked and accesses bpf_redirect_info
and it may also access other per-CPU variables like xskmap_flush_list.
At the very end of the NAPI callback, xdp_do_flush() is invoked which
does not access bpf_redirect_info but will touch the individual per-CPU
lists.
The per-CPU variables are only used in the NAPI callback hence disabling
bottom halves is the only protection mechanism. Users from preemptible
context (like cpu_map_kthread_run()) explicitly disable bottom halves
for protections reasons.
Without locking in local_bh_disable() on PREEMPT_RT this data structure
requires explicit locking.
PREEMPT_RT has forced-threaded interrupts enabled and every
NAPI-callback runs in a thread. If each thread has its own data
structure then locking can be avoided.
Create a struct bpf_net_context which contains struct bpf_redirect_info.
Define the variable on stack, use bpf_net_ctx_set() to save a pointer to
it, bpf_net_ctx_clear() removes it again.
The bpf_net_ctx_set() may nest. For instance a function can be used from
within NET_RX_SOFTIRQ/ net_rx_action which uses bpf_net_ctx_set() and
NET_TX_SOFTIRQ which does not. Therefore only the first invocations
updates the pointer.
Use bpf_net_ctx_get_ri() as a wrapper to retrieve the current struct
bpf_redirect_info. The returned data structure is zero initialized to
ensure nothing is leaked from stack. This is done on first usage of the
struct. bpf_net_ctx_set() sets bpf_redirect_info::kern_flags to 0 to
note that initialisation is required. First invocation of
bpf_net_ctx_get_ri() will memset() the data structure and update
bpf_redirect_info::kern_flags.
bpf_redirect_info::nh is excluded from memset because it is only used
once BPF_F_NEIGH is set which also sets the nh member. The kern_flags is
moved past nh to exclude it from memset.
The pointer to bpf_net_context is saved task's task_struct. Using
always the bpf_net_context approach has the advantage that there is
almost zero differences between PREEMPT_RT and non-PREEMPT_RT builds.
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Eduard Zingerman <eddyz87@gmail.com>
Cc: Hao Luo <haoluo@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: KP Singh <kpsingh@kernel.org>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Song Liu <song@kernel.org>
Cc: Stanislav Fomichev <sdf@google.com>
Cc: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://patch.msgid.link/20240620132727.660738-15-bigeasy@linutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add local_lock_nested_bh() locking. It is based on local_lock_t and the
naming follows the preempt_disable_nested() example.
For !PREEMPT_RT + !LOCKDEP it is a per-CPU annotation for locking
assumptions based on local_bh_disable(). The macro is optimized away
during compilation.
For !PREEMPT_RT + LOCKDEP the local_lock_nested_bh() is reduced to
the usual lock-acquire plus lockdep_assert_in_softirq() - ensuring that
BH is disabled.
For PREEMPT_RT local_lock_nested_bh() acquires the specified per-CPU
lock. It does not disable CPU migration because it relies on
local_bh_disable() disabling CPU migration.
With LOCKDEP it performans the usual lockdep checks as with !PREEMPT_RT.
Due to include hell the softirq check has been moved spinlock.c.
The intention is to use this locking in places where locking of a per-CPU
variable relies on BH being disabled. Instead of treating disabled
bottom halves as a big per-CPU lock, PREEMPT_RT can use this to reduce
the locking scope to what actually needs protecting.
A side effect is that it also documents the protection scope of the
per-CPU variables.
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://patch.msgid.link/20240620132727.660738-3-bigeasy@linutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Zac's syzbot crafted a bpf prog that exposed two bugs in may_goto.
The 1st bug is the way may_goto is patched. When offset is negative
it should be patched differently.
The 2nd bug is in the verifier:
when current state may_goto_depth is equal to visited state may_goto_depth
it means there is an actual infinite loop. It's not correct to prune
exploration of the program at this point.
Note, that this check doesn't limit the program to only one may_goto insn,
since 2nd and any further may_goto will increment may_goto_depth only
in the queued state pushed for future exploration. The current state
will have may_goto_depth == 0 regardless of number of may_goto insns
and the verifier has to explore the program until bpf_exit.
Fixes: 011832b97b ("bpf: Introduce may_goto instruction")
Reported-by: Zac Ecob <zacecob@protonmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Closes: https://lore.kernel.org/bpf/CAADnVQL-15aNp04-cyHRn47Yv61NXfYyhopyZtUyxNojUZUXpA@mail.gmail.com/
Link: https://lore.kernel.org/bpf/20240619235355.85031-1-alexei.starovoitov@gmail.com
After the rework of "Parallel CPU bringup", the cmdline "nosmp" and
"maxcpus=0" parameters are not working anymore. These parameters set
setup_max_cpus to zero and that's handed to bringup_nonboot_cpus().
The code there does a decrement before checking for zero, which brings it
into the negative space and brings up all CPUs.
Add a zero check at the beginning of the function to prevent this.
[ tglx: Massaged change log ]
Fixes: 18415f33e2 ("cpu/hotplug: Allow "parallel" bringup up to CPUHP_BP_KICK_AP_STATE")
Fixes: 06c6796e03 ("cpu/hotplug: Fix off by one in cpuhp_bringup_mask()")
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20240618081336.3996825-1-chenhuacai@loongson.cn
The scx_bpf_cpuperf_set() kfunc allows a BPF program to set the relative
performance target of a specified CPU. Commit d86adb4fc0 ("sched_ext: Add
cpuperf support") defined the @cpu argument to be unsigned. Let's update it
to be signed to match the norm for the rest of ext.c and the kernel.
Note that the kfunc declaration of scx_bpf_cpuperf_set() in the
common.bpf.h header in tools/sched_ext already listed the cpu as signed, so
this also fixes the build for tools/sched_ext and the sched_ext selftests
due to kfunc declarations now being emitted in vmlinux.h based on BTF (thus
causing the compiler to error due to observing conflicting types).
Fixes: d86adb4fc0 ("sched_ext: Add cpuperf support")
Signed-off-by: David Vernet <void@manifault.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
sched_ext currently does not integrate with schedutil. When schedutil is the
governor, frequencies are left unregulated and usually get stuck close to
the highest performance level from running RT tasks.
Add CPU performance monitoring and scaling support by integrating into
schedutil. The following kfuncs are added:
- scx_bpf_cpuperf_cap(): Query the relative performance capacity of
different CPUs in the system.
- scx_bpf_cpuperf_cur(): Query the current performance level of a CPU
relative to its max performance.
- scx_bpf_cpuperf_set(): Set the current target performance level of a CPU.
This gives direct control over CPU performance setting to the BPF scheduler.
The only changes on the schedutil side are accounting for the utilization
factor from sched_ext and disabling frequency holding heuristics as it may
not apply well to sched_ext schedulers which may have a lot weaker
connection between tasks and their current / last CPU.
With cpuperf support added, there is no reason to block uclamp. Enable while
at it.
A toy implementation of cpuperf is added to scx_qmap as a demonstration of
the feature.
v2: Ignore cpu_util_cfs_boost() when scx_switched_all() in sugov_get_util()
to avoid factoring in stale util metric. (Christian)
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: David Vernet <dvernet@meta.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Christian Loehle <christian.loehle@arm.com>
sugov_cpu_is_busy() is used to avoid decreasing performance level while the
CPU is busy and called by sugov_update_single_freq() and
sugov_update_single_perf(). Both callers repeat the same pattern to first
test for uclamp and then the business. Let's refactor so that the tests
aren't repeated.
The new helper is named sugov_hold_freq() and tests both the uclamp
exception and CPU business. No functional changes. This will make adding
more exception conditions easier.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: David Vernet <dvernet@meta.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
A dying worker is first moved from pool->workers to pool->dying_workers
in set_worker_dying() and removed from pool->dying_workers in
detach_dying_workers(). The whole procedure is in the some lock context
of wq_pool_attach_mutex.
So pool->dying_workers is useless, just remove it and keep the dying
worker in pool->workers after set_worker_dying() and remove it in
detach_dying_workers() with wq_pool_attach_mutex held.
Cc: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The code to kick off the destruction of workers is now in a process
context (idle_cull_fn()), and the detaching of a worker is not required
to be inside the worker thread now, so just do the detaching directly
in idle_cull_fn().
wake_dying_workers() is renamed to detach_dying_workers() and the unneeded
wakeup in wake_dying_workers() is also removed.
Cc: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
So that when the rescuer is woken up next time, it will not interrupt
the last working cpu which might be busy on other crucial works but
have nothing to do with the rescuer's incoming works.
Cc: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The code to kick off the destruction of workers is now in a process
context (idle_cull_fn()), so kthread_stop() can be used in the process
context to replace the work of pool->detach_completion.
The wakeup in wake_dying_workers() is unneeded after this change, but it
is harmless, jut keep it here until next patch renames wake_dying_workers()
rather than renaming it again and again.
Cc: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Share relocation implementation with the kernel. As part of this,
we also need the type/string iteration functions so also share
btf_iter.c file. Relocation code in kernel and userspace is identical
save for the impementation of the reparenting of split BTF to the
relocated base BTF and retrieval of the BTF header from "struct btf";
these small functions need separate user-space and kernel implementations
for the separate "struct btf"s they operate upon.
One other wrinkle on the kernel side is we have to map .BTF.ids in
modules as they were generated with the type ids used at BTF encoding
time. btf_relocate() optionally returns an array mapping from old BTF
ids to relocated ids, so we use that to fix up these references where
needed for kfuncs.
Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20240620091733.1967885-5-alan.maguire@oracle.com
...as this will allow split BTF modules with a base BTF
representation (rather than the full vmlinux BTF at time of
BTF encoding) to resolve their references to kernel types in a
way that is more resilient to small changes in kernel types.
This will allow modules that are not built every time the kernel
is to provide more resilient BTF, rather than have it invalidated
every time BTF ids for core kernel types change.
Fields are ordered to avoid holes in struct module.
Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240620091733.1967885-3-alan.maguire@oracle.com
The BPF ring buffer internally is implemented as a power-of-2 sized circular
buffer, with two logical and ever-increasing counters: consumer_pos is the
consumer counter to show which logical position the consumer consumed the
data, and producer_pos which is the producer counter denoting the amount of
data reserved by all producers.
Each time a record is reserved, the producer that "owns" the record will
successfully advance producer counter. In user space each time a record is
read, the consumer of the data advanced the consumer counter once it finished
processing. Both counters are stored in separate pages so that from user
space, the producer counter is read-only and the consumer counter is read-write.
One aspect that simplifies and thus speeds up the implementation of both
producers and consumers is how the data area is mapped twice contiguously
back-to-back in the virtual memory, allowing to not take any special measures
for samples that have to wrap around at the end of the circular buffer data
area, because the next page after the last data page would be first data page
again, and thus the sample will still appear completely contiguous in virtual
memory.
Each record has a struct bpf_ringbuf_hdr { u32 len; u32 pg_off; } header for
book-keeping the length and offset, and is inaccessible to the BPF program.
Helpers like bpf_ringbuf_reserve() return `(void *)hdr + BPF_RINGBUF_HDR_SZ`
for the BPF program to use. Bing-Jhong and Muhammad reported that it is however
possible to make a second allocated memory chunk overlapping with the first
chunk and as a result, the BPF program is now able to edit first chunk's
header.
For example, consider the creation of a BPF_MAP_TYPE_RINGBUF map with size
of 0x4000. Next, the consumer_pos is modified to 0x3000 /before/ a call to
bpf_ringbuf_reserve() is made. This will allocate a chunk A, which is in
[0x0,0x3008], and the BPF program is able to edit [0x8,0x3008]. Now, lets
allocate a chunk B with size 0x3000. This will succeed because consumer_pos
was edited ahead of time to pass the `new_prod_pos - cons_pos > rb->mask`
check. Chunk B will be in range [0x3008,0x6010], and the BPF program is able
to edit [0x3010,0x6010]. Due to the ring buffer memory layout mentioned
earlier, the ranges [0x0,0x4000] and [0x4000,0x8000] point to the same data
pages. This means that chunk B at [0x4000,0x4008] is chunk A's header.
bpf_ringbuf_submit() / bpf_ringbuf_discard() use the header's pg_off to then
locate the bpf_ringbuf itself via bpf_ringbuf_restore_from_rec(). Once chunk
B modified chunk A's header, then bpf_ringbuf_commit() refers to the wrong
page and could cause a crash.
Fix it by calculating the oldest pending_pos and check whether the range
from the oldest outstanding record to the newest would span beyond the ring
buffer size. If that is the case, then reject the request. We've tested with
the ring buffer benchmark in BPF selftests (./benchs/run_bench_ringbufs.sh)
before/after the fix and while it seems a bit slower on some benchmarks, it
is still not significantly enough to matter.
Fixes: 457f44363a ("bpf: Implement BPF ring buffer and verifier support for it")
Reported-by: Bing-Jhong Billy Jheng <billy@starlabs.sg>
Reported-by: Muhammad Ramdhan <ramdhan@starlabs.sg>
Co-developed-by: Bing-Jhong Billy Jheng <billy@starlabs.sg>
Co-developed-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Bing-Jhong Billy Jheng <billy@starlabs.sg>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240621140828.18238-1-daniel@iogearbox.net
scx_next_task_picked() is used by sched_ext to notify the BPF scheduler when
a CPU is taken away by a task dispatched from a higher priority sched_class
so that the BPF scheduler can, e.g., punt the task[s] which was running or
were waiting for the CPU to other CPUs.
Replace the sched_ext specific hook scx_next_task_picked() with a new
sched_class operation switch_class().
The changes are straightforward and the code looks better afterwards.
However, when !CONFIG_SCHED_CLASS_EXT, this ends up adding an unused hook
which is unlikely to be useful to other sched_classes. For further
discussion on this subject, please refer to the following:
http://lkml.kernel.org/r/CAHk-=wjFPLqo7AXu8maAGEGnOy6reUg-F4zzFhVB0Kyu22h7pw@mail.gmail.com
Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
When the following program is processed by the verifier:
L1: may_goto L2
goto L1
L2: w0 = 0
exit
the may_goto insn is first converted to:
L1: r11 = *(u64 *)(r10 -8)
if r11 == 0x0 goto L2
r11 -= 1
*(u64 *)(r10 -8) = r11
goto L1
L2: w0 = 0
exit
then later as the last step the verifier inserts:
*(u64 *)(r10 -8) = BPF_MAX_LOOPS
as the first insn of the program to initialize loop count.
When the first insn happens to be a branch target of some jmp the
bpf_patch_insn_data() logic will produce:
L1: *(u64 *)(r10 -8) = BPF_MAX_LOOPS
r11 = *(u64 *)(r10 -8)
if r11 == 0x0 goto L2
r11 -= 1
*(u64 *)(r10 -8) = r11
goto L1
L2: w0 = 0
exit
because instruction patching adjusts all jmps and calls, but for this
particular corner case it's incorrect and the L1 label should be one
instruction down, like:
*(u64 *)(r10 -8) = BPF_MAX_LOOPS
L1: r11 = *(u64 *)(r10 -8)
if r11 == 0x0 goto L2
r11 -= 1
*(u64 *)(r10 -8) = r11
goto L1
L2: w0 = 0
exit
and that's what this patch is fixing.
After bpf_patch_insn_data() call adjust_jmp_off() to adjust all jmps
that point to newly insert BPF_ST insn to point to insn after.
Note that bpf_patch_insn_data() cannot easily be changed to accommodate
this logic, since jumps that point before or after a sequence of patched
instructions have to be adjusted with the full length of the patch.
Conceptually it's somewhat similar to "insert" of instructions between other
instructions with weird semantics. Like "insert" before 1st insn would require
adjustment of CALL insns to point to newly inserted 1st insn, but not an
adjustment JMP insns that point to 1st, yet still adjusting JMP insns that
cross over 1st insn (point to insn before or insn after), hence use simple
adjust_jmp_off() logic to fix this corner case. Ideally bpf_patch_insn_data()
would have an auxiliary info to say where 'the start of newly inserted patch
is', but it would be too complex for backport.
Fixes: 011832b97b ("bpf: Introduce may_goto instruction")
Reported-by: Zac Ecob <zacecob@protonmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Closes: https://lore.kernel.org/bpf/CAADnVQJ_WWx8w4b=6Gc2EpzAjgv+6A0ridnMz2TvS2egj4r3Gw@mail.gmail.com/
Link: https://lore.kernel.org/bpf/20240619011859.79334-1-alexei.starovoitov@gmail.com
The new generic LSM hook security_file_post_open() was recently added
to the LSM framework in commit 8f46ff5767 ("security: Introduce
file_post_open hook"). Let's proactively add this generic LSM hook to
the sleepable_lsm_hooks BTF ID set, because I can't see there being
any strong reasons not to, and it's only a matter of time before
someone else comes around and asks for it to be there.
security_file_post_open() is inherently sleepable as it's purposely
situated in the kernel that allows LSMs to directly read out the
contents of the backing file if need be. Additionally, it's called
directly after security_file_open(), and that LSM hook in itself
already exists in the sleepable_lsm_hooks BTF ID set.
Signed-off-by: Matt Bobrowski <mattbobrowski@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20240618192923.379852-1-mattbobrowski@google.com
The function kdb_position_cursor() takes in a "prompt" parameter but
never uses it. This doesn't _really_ matter since all current callers
of the function pass the same value and it's a global variable, but
it's a bit ugly. Let's clean it up.
Found by code inspection. This patch is expected to functionally be a
no-op.
Fixes: 09b3598942 ("kdb: Use format-strings rather than '\0' injection in kdb_read()")
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Link: https://lore.kernel.org/r/20240528071144.1.I0feb49839c6b6f4f2c4bf34764f5e95de3f55a66@changeid
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
When -Wformat-security is not disabled, using a string pointer
as a format causes a warning:
kernel/debug/kdb/kdb_io.c: In function 'kdb_read':
kernel/debug/kdb/kdb_io.c:365:36: error: format not a string literal and no format arguments [-Werror=format-security]
365 | kdb_printf(kdb_prompt_str);
| ^~~~~~~~~~~~~~
kernel/debug/kdb/kdb_io.c: In function 'kdb_getstr':
kernel/debug/kdb/kdb_io.c:456:20: error: format not a string literal and no format arguments [-Werror=format-security]
456 | kdb_printf(kdb_prompt_str);
| ^~~~~~~~~~~~~~
Use an explcit "%s" format instead.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: 5d5314d679 ("kdb: core for kgdb back end (1 of 2)")
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Link: https://lore.kernel.org/r/20240528121154.3662553-1-arnd@kernel.org
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>