Martin KaFai Lau says:
====================
The bpf_prog can store specific info to a sk by using bpf_sk_storage.
In other words, a sk can be extended by a bpf_prog.
This series is to support providing bpf_sk_storage data during inet_diag's
dump. The primary target is the usage like iproute2's "ss".
The first two patches are refactoring works in inet_diag to make
adding bpf_sk_storage support easier. The next two patches do
the actual work.
Please see individual patch for details.
v2:
- Add commit message for u16 to u32 change in min_dump_alloc in Patch 4 (Song)
- Add comment to explain the !skb->len check in __inet_diag_dump in Patch 4.
- Do the map->map_type check earlier in Patch 3 for readability.
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This patch will dump out the bpf_sk_storages of a sk
if the request has the INET_DIAG_REQ_SK_BPF_STORAGES nlattr.
An array of SK_DIAG_BPF_STORAGE_REQ_MAP_FD can be specified in
INET_DIAG_REQ_SK_BPF_STORAGES to select which bpf_sk_storage to dump.
If no map_fd is specified, all bpf_sk_storages of a sk will be dumped.
bpf_sk_storages can be added to the system at runtime. It is difficult
to find a proper static value for cb->min_dump_alloc.
This patch learns the nlattr size required to dump the bpf_sk_storages
of a sk. If it happens to be the very first nlmsg of a dump and it
cannot fit the needed bpf_sk_storages, it will try to expand the
skb by "pskb_expand_head()".
Instead of expanding it in inet_sk_diag_fill(), it is expanded at a
sleepable context in __inet_diag_dump() so __GFP_DIRECT_RECLAIM can
be used. In __inet_diag_dump(), it will retry as long as the
skb is empty and the cb->min_dump_alloc becomes larger than before.
cb->min_dump_alloc is bounded by KMALLOC_MAX_SIZE. The min_dump_alloc
is also changed from 'u16' to 'u32' to accommodate a sk that may have
a few large bpf_sk_storages.
The updated cb->min_dump_alloc will also be used to allocate the skb in
the next dump. This logic already exists in netlink_dump().
Here is the sample output of a locally modified 'ss' and it could be made
more readable by using BTF later:
[root@arch-fb-vm1 ~]# ss --bpf-map-id 14 --bpf-map-id 13 -t6an 'dst [::1]:8989'
State Recv-Q Send-Q Local Address:Port Peer Address:PortProcess
ESTAB 0 0 [::1]:51072 [::1]:8989
bpf_map_id:14 value:[ 3feb ]
bpf_map_id:13 value:[ 3f ]
ESTAB 0 0 [::1]:51070 [::1]:8989
bpf_map_id:14 value:[ 3feb ]
bpf_map_id:13 value:[ 3f ]
[root@arch-fb-vm1 ~]# ~/devshare/github/iproute2/misc/ss --bpf-maps -t6an 'dst [::1]:8989'
State Recv-Q Send-Q Local Address:Port Peer Address:Port Process
ESTAB 0 0 [::1]:51072 [::1]:8989
bpf_map_id:14 value:[ 3feb ]
bpf_map_id:13 value:[ 3f ]
bpf_map_id:12 value:[ 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000... total:65407 ]
ESTAB 0 0 [::1]:51070 [::1]:8989
bpf_map_id:14 value:[ 3feb ]
bpf_map_id:13 value:[ 3f ]
bpf_map_id:12 value:[ 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000... total:65407 ]
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20200225230427.1976129-1-kafai@fb.com
This patch adds INET_DIAG support to bpf_sk_storage.
1. Although this series adds bpf_sk_storage diag capability to inet sk,
bpf_sk_storage is in general applicable to all fullsock. Hence, the
bpf_sk_storage logic will operate on SK_DIAG_* nlattr. The caller
will pass in its specific nesting nlattr (e.g. INET_DIAG_*) as
the argument.
2. The request will be like:
INET_DIAG_REQ_SK_BPF_STORAGES (nla_nest) (defined in latter patch)
SK_DIAG_BPF_STORAGE_REQ_MAP_FD (nla_put_u32)
SK_DIAG_BPF_STORAGE_REQ_MAP_FD (nla_put_u32)
......
Considering there could have multiple bpf_sk_storages in a sk,
instead of reusing INET_DIAG_INFO ("ss -i"), the user can select
some specific bpf_sk_storage to dump by specifying an array of
SK_DIAG_BPF_STORAGE_REQ_MAP_FD.
If no SK_DIAG_BPF_STORAGE_REQ_MAP_FD is specified (i.e. an empty
INET_DIAG_REQ_SK_BPF_STORAGES), it will dump all bpf_sk_storages
of a sk.
3. The reply will be like:
INET_DIAG_BPF_SK_STORAGES (nla_nest) (defined in latter patch)
SK_DIAG_BPF_STORAGE (nla_nest)
SK_DIAG_BPF_STORAGE_MAP_ID (nla_put_u32)
SK_DIAG_BPF_STORAGE_MAP_VALUE (nla_reserve_64bit)
SK_DIAG_BPF_STORAGE (nla_nest)
SK_DIAG_BPF_STORAGE_MAP_ID (nla_put_u32)
SK_DIAG_BPF_STORAGE_MAP_VALUE (nla_reserve_64bit)
......
4. Unlike other INET_DIAG info of a sk which is pretty static, the size
required to dump the bpf_sk_storage(s) of a sk is dynamic as the
system adding more bpf_sk_storage_map. It is hard to set a static
min_dump_alloc size.
Hence, this series learns it at the runtime and adjust the
cb->min_dump_alloc as it iterates all sk(s) of a system. The
"unsigned int *res_diag_size" in bpf_sk_storage_diag_put()
is for this purpose.
The next patch will update the cb->min_dump_alloc as it
iterates the sk(s).
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20200225230421.1975729-1-kafai@fb.com
The INET_DIAG_REQ_BYTECODE nlattr is currently re-found every time when
the "dump()" is re-started.
In a latter patch, it will also need to parse the new
INET_DIAG_REQ_SK_BPF_STORAGES nlattr to learn the map_fds. Thus, this
patch takes this chance to store the parsed nlattr in cb->data
during the "start" time of a dump.
By doing this, the "bc" argument also becomes unnecessary
and is removed. Also, the two copies of the INET_DIAG_REQ_BYTECODE
parsing-audit logic between compat/current version can be
consolidated to one.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20200225230415.1975555-1-kafai@fb.com
In a latter patch, there is a need to update "cb->min_dump_alloc"
in inet_sk_diag_fill() as it learns the diffierent bpf_sk_storages
stored in a sk while dumping all sk(s) (e.g. tcp_hashinfo).
The inet_sk_diag_fill() currently does not take the "cb" as an argument.
One of the reason is inet_sk_diag_fill() is used by both dump_one()
and dump() (which belong to the "struct inet_diag_handler". The dump_one()
interface does not pass the "cb" along.
This patch is to make dump_one() pass a "cb". The "cb" is created in
inet_diag_cmd_exact(). The "nlh" and "in_skb" are stored in "cb" as
the dump() interface does. The total number of args in
inet_sk_diag_fill() is also cut from 10 to 7 and
that helps many callers to pass fewer args.
In particular,
"struct user_namespace *user_ns", "u32 pid", and "u32 seq"
can be replaced by accessing "cb->nlh" and "cb->skb".
A similar argument reduction is also made to
inet_twsk_diag_fill() and inet_req_diag_fill().
inet_csk_diag_dump() and inet_csk_diag_fill() are also removed.
They are mostly equivalent to inet_sk_diag_fill(). Their repeated
usages are very limited. Thus, inet_sk_diag_fill() is directly used
in those occasions.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20200225230409.1975173-1-kafai@fb.com
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:
struct foo {
int stuff;
struct boo array[];
};
By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.
Also, notice that, dynamic memory allocations won't be affected by
this change:
"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]
This issue was found with the help of Coccinelle.
[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20200227001744.GA3317@embeddedor
Michal Rostecki says:
====================
Feature probes in bpftool related to bpf_probe_write_user and
bpf_trace_printk helpers emit dmesg warnings which might be confusing
for people running bpftool on production environments. This patch series
addresses that by filtering them out by default and introducing the new
positional argument "full" which enables all available probes.
The main motivation behind those changes is ability the fact that some
probes (for example those related to "trace" or "write_user" helpers)
emit dmesg messages which might be confusing for people who are running
on production environments. For details see the Cilium issue[0].
v1 -> v2:
- Do not expose regex filters to users, keep filtering logic internal,
expose only the "full" option for including probes which emit dmesg
warnings.
v2 -> v3:
- Do not use regex for filtering out probes, use function IDs directly.
- Fix bash completion - in v2 only "prefix" was proposed after "macros",
"dev" and "kernel" were not.
- Rephrase the man page paragraph, highlight helper function names.
- Remove tests which parse the plain output of bpftool (except the
header/macros test), focus on testing JSON output instead.
- Add test which compares the output with and without "full" option.
v3 -> v4:
- Use enum to check for helper functions.
- Make selftests compatible with older versions of Python 3.x than 3.7.
[0] https://github.com/cilium/cilium/issues/10048
====================
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Due to various bugs in tests clean up code (usually), if host system is
misconfigured, it happens that test_progs will just crash in the middle of
running a test with little to no indication of where and why the crash
happened. For cases where coredump is not readily available (e.g., inside
a CI), it's very helpful to have a stack trace, which lead to crash, to be
printed out. This change adds a signal handler that will capture and print out
symbolized backtrace:
$ sudo ./test_progs -t mmap
test_mmap:PASS:skel_open_and_load 0 nsec
test_mmap:PASS:bss_mmap 0 nsec
test_mmap:PASS:data_mmap 0 nsec
Caught signal #11!
Stack trace:
./test_progs(crash_handler+0x18)[0x42a888]
/lib64/libpthread.so.0(+0xf5d0)[0x7f2aab5175d0]
./test_progs(test_mmap+0x3c0)[0x41f0a0]
./test_progs(main+0x160)[0x407d10]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f2aab15d3d5]
./test_progs[0x407ebc]
[1] 1988412 segmentation fault (core dumped) sudo ./test_progs -t mmap
Unfortunately, glibc's symbolization support is unable to symbolize static
functions, only global ones will be present in stack trace. But it's still a
step forward without adding extra libraries to get a better symbolization.
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20200225000847.3965188-1-andriin@fb.com
Currently we run SYN cookies test for all socket types and mark the test as
skipped if socket type is not compatible. This causes confusion because
skipped test might indicate a problem with the testing environment.
Instead, run the test only for the socket type which supports SYN cookies.
Also, switch to using designated initializers when setting up tests, so
that we can tweak only some test parameters, leaving the rest initialized
to default values.
Fixes: eecd618b45 ("selftests/bpf: Mark SYN cookie test skipped for UDP sockets")
Reported-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200224135327.121542-2-jakub@cloudflare.com
SOCKMAP and SOCKHASH map types can be used with reuseport BPF programs but
don't support yet storing UDP sockets. Instead of marking UDP tests with
SOCK{MAP,HASH} as skipped, don't run them at all.
Skipped test might signal that the test environment is not suitable for
running the test, while in reality the functionality is not implemented in
the kernel yet.
Before:
sh# ./test_progs -t select_reuseport
…
#40 select_reuseport:OK
Summary: 1/126 PASSED, 30 SKIPPED, 0 FAILED
After:
sh# ./test_progs -t select_reuseport
…
#40 select_reuseport:OK
Summary: 1/98 PASSED, 2 SKIPPED, 0 FAILED
The remaining two skipped tests are SYN cookies tests, which will be
addressed in the subsequent patch.
Fixes: 11318ba8ca ("selftests/bpf: Extend SK_REUSEPORT tests to cover SOCKMAP/SOCKHASH")
Reported-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200224135327.121542-1-jakub@cloudflare.com
Thomas Gleixner says:
====================
This is the third version of the BPF/RT patch set which makes both coexist
nicely. The long explanation can be found in the cover letter of the V1
submission:
https://lore.kernel.org/r/20200214133917.304937432@linutronix.de
V2 is here:
https://lore.kernel.org/r/20200220204517.863202864@linutronix.de
The following changes vs. V2 have been made:
- Rebased to bpf-next, adjusted to the lock changes in the hashmap code.
- Split the preallocation enforcement patch for instrumentation type BPF
programs into two pieces:
1) Emit a one-time warning on !RT kernels when any instrumentation type
BPF program uses run-time allocation. Emit also a corresponding
warning in the verifier log. But allow the program to run for
backward compatibility sake. After a grace period this should be
enforced.
2) On RT reject such programs because on RT the memory allocator cannot
be called from truly atomic contexts.
- Fixed the fallout from V2 as reported by Alexei and 0-day
- Removed the redundant preempt_disable() from trace_call_bpf()
- Removed the unused export of trace_call_bpf()
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
In a RT kernel down_read_trylock() cannot be used from NMI context and
up_read_non_owner() is another problematic issue.
So in such a configuration, simply elide the annotated stackmap and
just report the raw IPs.
In the longer term, it might be possible to provide a atomic friendly
versions of the page cache traversal which will at least provide the info
if the pages are resident and don't need to be paged in.
[ tglx: Use IS_ENABLED() to avoid the #ifdeffery, fixup the irq work
callback and add a comment ]
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200224145644.708960317@linutronix.de
The LPM trie map cannot be used in contexts like perf, kprobes and tracing
as this map type dynamically allocates memory.
The memory allocation happens with a raw spinlock held which is a truly
spinning lock on a PREEMPT RT enabled kernel which disables preemption and
interrupts.
As RT does not allow memory allocation from such a section for various
reasons, convert the raw spinlock to a regular spinlock.
On a RT enabled kernel these locks are substituted by 'sleeping' spinlocks
which provide the proper protection but keep the code preemptible.
On a non-RT kernel regular spinlocks map to raw spinlocks, i.e. this does
not cause any functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200224145644.602129531@linutronix.de
PREEMPT_RT forbids certain operations like memory allocations (even with
GFP_ATOMIC) from atomic contexts. This is required because even with
GFP_ATOMIC the memory allocator calls into code pathes which acquire locks
with long held lock sections. To ensure the deterministic behaviour these
locks are regular spinlocks, which are converted to 'sleepable' spinlocks
on RT. The only true atomic contexts on an RT kernel are the low level
hardware handling, scheduling, low level interrupt handling, NMIs etc. None
of these contexts should ever do memory allocations.
As regular device interrupt handlers and soft interrupts are forced into
thread context, the existing code which does
spin_lock*(); alloc(GPF_ATOMIC); spin_unlock*();
just works.
In theory the BPF locks could be converted to regular spinlocks as well,
but the bucket locks and percpu_freelist locks can be taken from arbitrary
contexts (perf, kprobes, tracepoints) which are required to be atomic
contexts even on RT. These mechanisms require preallocated maps, so there
is no need to invoke memory allocations within the lock held sections.
BPF maps which need dynamic allocation are only used from (forced) thread
context on RT and can therefore use regular spinlocks which in turn allows
to invoke memory allocations from the lock held section.
To achieve this make the hash bucket lock a union of a raw and a regular
spinlock and initialize and lock/unlock either the raw spinlock for
preallocated maps or the regular variant for maps which require memory
allocations.
On a non RT kernel this distinction is neither possible nor required.
spinlock maps to raw_spinlock and the extra code and conditional is
optimized out by the compiler. No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200224145644.509685912@linutronix.de
As a preparation for making the BPF locking RT friendly, factor out the
hash bucket lock operations into inline functions. This allows to do the
necessary RT modification in one place instead of sprinkling it all over
the place. No functional change.
The now unused htab argument of the lock/unlock functions will be used in
the next step which adds PREEMPT_RT support.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200224145644.420416916@linutronix.de
The required protection is that the caller cannot be migrated to a
different CPU as these functions end up in places which take either a hash
bucket lock or might trigger a kprobe inside the memory allocator. Both
scenarios can lead to deadlocks. The deadlock prevention is per CPU by
incrementing a per CPU variable which temporarily blocks the invocation of
BPF programs from perf and kprobes.
Replace the open coded preempt_[dis|en]able and __this_cpu_[inc|dec] pairs
with the new helper functions. These functions are already prepared to make
BPF work on PREEMPT_RT enabled kernels. No functional change for !RT
kernels.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200224145644.317843926@linutronix.de
The required protection is that the caller cannot be migrated to a
different CPU as these places take either a hash bucket lock or might
trigger a kprobe inside the memory allocator. Both scenarios can lead to
deadlocks. The deadlock prevention is per CPU by incrementing a per CPU
variable which temporarily blocks the invocation of BPF programs from perf
and kprobes.
Replace the open coded preempt_disable/enable() and this_cpu_inc/dec()
pairs with the new recursion prevention helpers to prepare BPF to work on
PREEMPT_RT enabled kernels. On a non-RT kernel the migrate disable/enable
in the helpers map to preempt_disable/enable(), i.e. no functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200224145644.211208533@linutronix.de
The places which need to prevent the execution of trace type BPF programs
to prevent deadlocks on the hash bucket lock do this open coded.
Provide two inline functions, bpf_disable/enable_instrumentation() to
replace these open coded protection constructs.
Use migrate_disable/enable() instead of preempt_disable/enable() right away
so this works on RT enabled kernels. On a !RT kernel migrate_disable /
enable() are mapped to preempt_disable/enable().
These helpers use this_cpu_inc/dec() instead of __this_cpu_inc/dec() on an
RT enabled kernel because migrate disabled regions are preemptible and
preemption might hit in the middle of a RMW operation which can lead to
inconsistent state.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200224145644.103910133@linutronix.de
Replace the preemption disable/enable with migrate_disable/enable() to
reflect the actual requirement and to allow PREEMPT_RT to substitute it
with an actual migration disable mechanism which does not disable
preemption.
Including the code paths that go via __bpf_prog_run_save_cb().
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200224145643.998293311@linutronix.de
Replace the preemption disable/enable with migrate_disable/enable() to
reflect the actual requirement and to allow PREEMPT_RT to substitute it
with an actual migration disable mechanism which does not disable
preemption.
[ tglx: Switched it over to migrate disable ]
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200224145643.785306549@linutronix.de
All of these cases are strictly of the form:
preempt_disable();
BPF_PROG_RUN(...);
preempt_enable();
Replace this with bpf_prog_run_pin_on_cpu() which wraps BPF_PROG_RUN()
with:
migrate_disable();
BPF_PROG_RUN(...);
migrate_enable();
On non RT enabled kernels this maps to preempt_disable/enable() and on RT
enabled kernels this solely prevents migration, which is sufficient as
there is no requirement to prevent reentrancy to any BPF program from a
preempting task. The only requirement is that the program stays on the same
CPU.
Therefore, this is a trivially correct transformation.
The seccomp loop does not need protection over the loop. It only needs
protection per BPF filter program
[ tglx: Converted to bpf_prog_run_pin_on_cpu() ]
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200224145643.691493094@linutronix.de
As already discussed in the previous change which introduced
BPF_RUN_PROG_PIN_ON_CPU() BPF only requires to disable migration to
guarantee per CPUness.
If RT substitutes the preempt disable based migration protection then the
cant_sleep() check will obviously trigger as preemption is not disabled.
Replace it by cant_migrate() which maps to cant_sleep() on a non RT kernel
and will verify that migration is disabled on a full RT kernel.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200224145643.583038889@linutronix.de
BPF programs require to run on one CPU to completion as they use per CPU
storage, but according to Alexei they don't need reentrancy protection as
obviously BPF programs running in thread context can always be 'preempted'
by hard and soft interrupts and instrumentation and the same program can
run concurrently on a different CPU.
The currently used mechanism to ensure CPUness is to wrap the invocation
into a preempt_disable/enable() pair. Disabling preemption is also
disabling migration for a task.
preempt_disable/enable() is used because there is no explicit way to
reliably disable only migration.
Provide a separate macro to invoke a BPF program which can be used in
migrateable task context.
It wraps BPF_PROG_RUN() in a migrate_disable/enable() pair which maps on
non RT enabled kernels to preempt_disable/enable(). On RT enabled kernels
this merely disables migration. Both methods ensure that the invoked BPF
program runs on one CPU to completion.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200224145643.474592620@linutronix.de
pcpu_freelist_populate() is disabling interrupts and then iterates over the
possible CPUs. The reason why this disables interrupts is to silence
lockdep because the invoked ___pcpu_freelist_push() takes spin locks.
Neither the interrupt disabling nor the locking are required in this
function because it's called during initialization and the resulting map is
not yet visible to anything.
Split out the actual push assignement into an inline, call it from the loop
and remove the interrupt disable.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200224145643.365930116@linutronix.de
If an element is freed via RCU then recursion into BPF instrumentation
functions is not a concern. The element is already detached from the map
and the RCU callback does not hold any locks on which a kprobe, perf event
or tracepoint attached BPF program could deadlock.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200224145643.259118710@linutronix.de
Similar to __bpf_trace_run this is redundant because __bpf_trace_run() is
invoked from a trace point via __DO_TRACE() which already disables
preemption _before_ invoking any of the functions which are attached to a
trace point.
Remove it and add a cant_sleep() check.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200224145643.059995527@linutronix.de
trace_call_bpf() no longer disables preemption on its own.
All callers of this function has to do it explicitly.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
All callers are built in. No point to export this.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
__bpf_trace_run() disables preemption around the BPF_PROG_RUN() invocation.
This is redundant because __bpf_trace_run() is invoked from a trace point
via __DO_TRACE() which already disables preemption _before_ invoking any of
the functions which are attached to a trace point.
Remove it and add a cant_sleep() check.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200224145642.847220186@linutronix.de
The comment where the bucket lock is acquired says:
/* bpf_map_update_elem() can be called in_irq() */
which is not really helpful and aside of that it does not explain the
subtle details of the hash bucket locks expecially in the context of BPF
and perf, kprobes and tracing.
Add a comment at the top of the file which explains the protection scopes
and the details how potential deadlocks are prevented.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200224145642.755793061@linutronix.de
Aside of the general unsafety of run-time map allocation for
instrumentation type programs RT enabled kernels have another constraint:
The instrumentation programs are invoked with preemption disabled, but the
memory allocator spinlocks cannot be acquired in atomic context because
they are converted to 'sleeping' spinlocks on RT.
Therefore enforce map preallocation for these programs types when RT is
enabled.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200224145642.648784007@linutronix.de
The assumption that only programs attached to perf NMI events can deadlock
on memory allocators is wrong. Assume the following simplified callchain:
kmalloc() from regular non BPF context
cache empty
freelist empty
lock(zone->lock);
tracepoint or kprobe
BPF()
update_elem()
lock(bucket)
kmalloc()
cache empty
freelist empty
lock(zone->lock); <- DEADLOCK
There are other ways which do not involve locking to create wreckage:
kmalloc() from regular non BPF context
local_irq_save();
...
obj = slab_first();
kprobe()
BPF()
update_elem()
lock(bucket)
kmalloc()
local_irq_save();
...
obj = slab_first(); <- Same object as above ...
So preallocation _must_ be enforced for all variants of intrusive
instrumentation.
Unfortunately immediate enforcement would break backwards compatibility, so
for now such programs still are allowed to run, but a one time warning is
emitted in dmesg and the verifier emits a warning in the verifier log as
well so developers are made aware about this and can fix their programs
before the enforcement becomes mandatory.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200224145642.540542802@linutronix.de
Saeed Mahameed says:
====================
mlxfw: Improve error reporting and FW reactivate support
This patchset improves mlxfw error reporting to netlink and to
kernel log.
V2:
- Use proper err codes, EBUSY/EIO instead of EALREADY/EREMOTEIO
- Fix typo.
From Eran and me.
1) patch #1, Make mlxfw/mlxsw fw flash devlink status notify generic,
and enable it for mlx5.
2) patches #2..#5 are improving mlxfw flash error messages by
reporting detailed mlxfw FSM error messages to netlink and kernel log.
3) patches #6,7 From Eran: Add FW reactivate flow to mlxfw and mlx5
====================
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support for fsm reactivate via MIRC (Management Image Re-activation
Control) set and query commands.
For re-activation flow, driver shall first run MIRC set, and then wait
until FW is done (via querying MIRC status).
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Expose fsm_reactivate callback to the mlxfw_dev_ops struct. FSM reactivate
is needed before flashing the new image in order to flush the old flashed
but not running firmware image.
In case mlxfw_dev do not support the reactivation, this step will be
skipped. But if later image flash will fail, a hint will be provided by
the extack to advise the user that the failure might be related to it.
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of always calling both mlxfw_err and NL_SET_ERR_MSG_MOD with the
same message, use the dedicated macro instead.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce mlxfw_{info, err, dbg} macros and make them call corresponding
dev_* macros, then convert all instances of pr_* to mlxfw_*.
This will allow printing the device name mlxfw is operating on.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make sure mlxfw_firmware_flash reports a detailed user readable error
message in every possible error path, basically every time
mlxfw_dev->ops->*() is called and an error is returned, or when image
initialization is failed.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>