Commit Graph

1247962 Commits

Author SHA1 Message Date
Dave Thaler
20e109ea98 bpf, docs: Clarify that MOVSX is only for BPF_X not BPF_K
Per discussion on the mailing list at
https://mailarchive.ietf.org/arch/msg/bpf/uQiqhURdtxV_ZQOTgjCdm-seh74/
the MOVSX operation is only defined to support register extension.

The document didn't previously state this and incorrectly implied
that one could use an immediate value.

Signed-off-by: Dave Thaler <dthaler1968@gmail.com>
Acked-by: David Vernet <void@manifault.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20240118232954.27206-1-dthaler1968@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-23 15:10:08 -08:00
Kuniyuki Iwashima
b3f086a7a1 bpf: Define struct bpf_tcp_req_attrs when CONFIG_SYN_COOKIES=n.
kernel test robot reported the warning below:

  >> net/core/filter.c:11842:13: warning: declaration of 'struct bpf_tcp_req_attrs' will not be visible outside of this function [-Wvisibility]
      11842 |                                         struct bpf_tcp_req_attrs *attrs, int attrs__sz)
            |                                                ^
     1 warning generated.

struct bpf_tcp_req_attrs is defined under CONFIG_SYN_COOKIES
but used in kfunc without the config.

Let's move struct bpf_tcp_req_attrs definition outside of
CONFIG_SYN_COOKIES guard.

Fixes: e472f88891 ("bpf: tcp: Support arbitrary SYN Cookie.")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202401180418.CUVc0hxF-lkp@intel.com/
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20240118211751.25790-1-kuniyu@amazon.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-23 15:08:03 -08:00
Hao Sun
2ce793ebe2 bpf: Refactor ptr alu checking rules to allow alu explicitly
Current checking rules are structured to disallow alu on particular ptr
types explicitly, so default cases are allowed implicitly. This may lead
to newly added ptr types being allowed unexpectedly. So restruture it to
allow alu explicitly. The tradeoff is mainly a bit more cases added in
the switch. The following table from Eduard summarizes the rules:

        | Pointer type        | Arithmetics allowed |
        |---------------------+---------------------|
        | PTR_TO_CTX          | yes                 |
        | CONST_PTR_TO_MAP    | conditionally       |
        | PTR_TO_MAP_VALUE    | yes                 |
        | PTR_TO_MAP_KEY      | yes                 |
        | PTR_TO_STACK        | yes                 |
        | PTR_TO_PACKET_META  | yes                 |
        | PTR_TO_PACKET       | yes                 |
        | PTR_TO_PACKET_END   | no                  |
        | PTR_TO_FLOW_KEYS    | conditionally       |
        | PTR_TO_SOCKET       | no                  |
        | PTR_TO_SOCK_COMMON  | no                  |
        | PTR_TO_TCP_SOCK     | no                  |
        | PTR_TO_TP_BUFFER    | yes                 |
        | PTR_TO_XDP_SOCK     | no                  |
        | PTR_TO_BTF_ID       | yes                 |
        | PTR_TO_MEM          | yes                 |
        | PTR_TO_BUF          | yes                 |
        | PTR_TO_FUNC         | yes                 |
        | CONST_PTR_TO_DYNPTR | yes                 |

The refactored rules are equivalent to the original one. Note that
PTR_TO_FUNC and CONST_PTR_TO_DYNPTR are not reject here because: (1)
check_mem_access() rejects load/store on those ptrs, and those ptrs
with offset passing to calls are rejected check_func_arg_reg_off();
(2) someone may rely on the verifier not rejecting programs earily.

Signed-off-by: Hao Sun <sunhao.th@gmail.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240117094012.36798-1-sunhao.th@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-23 15:01:39 -08:00
Andrey Grafin
40628f9fff selftest/bpf: Add map_in_maps with BPF_MAP_TYPE_PERF_EVENT_ARRAY values
Check that bpf_object__load() successfully creates map_in_maps
with BPF_MAP_TYPE_PERF_EVENT_ARRAY values.
These changes cover fix in the previous patch
"libbpf: Apply map_set_def_max_entries() for inner_maps on creation".

A command line output is:
- w/o fix
$ sudo ./test_maps
libbpf: map 'mim_array_pe': failed to create inner map: -22
libbpf: map 'mim_array_pe': failed to create: Invalid argument(-22)
libbpf: failed to load object './test_map_in_map.bpf.o'
Failed to load test prog

- with fix
$ sudo ./test_maps
...
test_maps: OK, 0 SKIPPED

Fixes: 646f02ffdd ("libbpf: Add BTF-defined map-in-map support")
Signed-off-by: Andrey Grafin <conquistador@yandex-team.ru>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/bpf/20240117130619.9403-2-conquistador@yandex-team.ru
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-23 14:43:12 -08:00
Andrey Grafin
f04deb90e5 libbpf: Apply map_set_def_max_entries() for inner_maps on creation
This patch allows to auto create BPF_MAP_TYPE_ARRAY_OF_MAPS and
BPF_MAP_TYPE_HASH_OF_MAPS with values of BPF_MAP_TYPE_PERF_EVENT_ARRAY
by bpf_object__load().

Previous behaviour created a zero filled btf_map_def for inner maps and
tried to use it for a map creation but the linux kernel forbids to create
a BPF_MAP_TYPE_PERF_EVENT_ARRAY map with max_entries=0.

Fixes: 646f02ffdd ("libbpf: Add BTF-defined map-in-map support")
Signed-off-by: Andrey Grafin <conquistador@yandex-team.ru>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/bpf/20240117130619.9403-1-conquistador@yandex-team.ru
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-23 14:43:12 -08:00
Daniel Borkmann
091f2bf60d bpf: Sync uapi bpf.h header for the tooling infra
Both commit 91051f0039 ("tcp: Dump bound-only sockets in inet_diag.")
and commit 985b8ea9ec7e ("bpf, docs: Fix bpf_redirect_peer header doc")
missed the tooling header sync. Fix it.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-23 14:43:12 -08:00
Victor Stewart
f98df79bf7 bpf, docs: Fix bpf_redirect_peer header doc
Amend the bpf_redirect_peer() header documentation to also mention
support for the netkit device type.

Signed-off-by: Victor Stewart <v@nametag.social>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20240116202952.241009-1-v@nametag.social
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-23 14:43:12 -08:00
Martin KaFai Lau
4eaafe5a5b Merge branch 'bpf: tcp: Support arbitrary SYN Cookie at TC.'
Kuniyuki Iwashima says:

====================
Under SYN Flood, the TCP stack generates SYN Cookie to remain stateless
for the connection request until a valid ACK is responded to the SYN+ACK.

The cookie contains two kinds of host-specific bits, a timestamp and
secrets, so only can it be validated by the generator.  It means SYN
Cookie consumes network resources between the client and the server;
intermediate nodes must remember which nodes to route ACK for the cookie.

SYN Proxy reduces such unwanted resource allocation by handling 3WHS at
the edge network.  After SYN Proxy completes 3WHS, it forwards SYN to the
backend server and completes another 3WHS.  However, since the server's
ISN differs from the cookie, the proxy must manage the ISN mappings and
fix up SEQ/ACK numbers in every packet for each connection.  If a proxy
node goes down, all the connections through it are terminated.  Keeping
a state at proxy is painful from that perspective.

At AWS, we use a dirty hack to build truly stateless SYN Proxy at scale.
Our SYN Proxy consists of the front proxy layer and the backend kernel
module.  (See slides of LPC2023 [0], p37 - p48)

The cookie that SYN Proxy generates differs from the kernel's cookie in
that it contains a secret (called rolling salt) (i) shared by all the proxy
nodes so that any node can validate ACK and (ii) updated periodically so
that old cookies cannot be validated and we need not encode a timestamp for
the cookie.  Also, ISN contains WScale, SACK, and ECN, not in TS val.  This
is not to sacrifice any connection quality, where some customers turn off
TCP timestamps option due to retro CVE.

After 3WHS, the proxy restores SYN, encapsulates ACK into SYN, and forward
the TCP-in-TCP packet to the backend server.  Our kernel module works at
Netfilter input/output hooks and first feeds SYN to the TCP stack to
initiate 3WHS.  When the module is triggered for SYN+ACK, it looks up the
corresponding request socket and overwrites tcp_rsk(req)->snt_isn with the
proxy's cookie.  Then, the module can complete 3WHS with the original ACK
as is.

This way, our SYN Proxy does not manage the ISN mappings nor wait for
SYN+ACK from the backend thus can remain stateless.  It's working very
well for high-bandwidth services like multiple Tbps, but we are looking
for a way to drop the dirty hack and further optimise the sequences.

If we could validate an arbitrary SYN Cookie on the backend server with
BPF, the proxy would need not restore SYN nor pass it.  After validating
ACK, the proxy node just needs to forward it, and then the server can do
the lightweight validation (e.g. check if ACK came from proxy nodes, etc)
and create a connection from the ACK.

This series allows us to create a full sk from an arbitrary SYN Cookie,
which is done in 3 steps.

  1) At tc, BPF prog calls a new kfunc to create a reqsk and configure
     it based on the argument populated from SYN Cookie.  The reqsk has
     its listener as req->rsk_listener and is passed to the TCP stack as
     skb->sk.

  2) During TCP socket lookup for the skb, skb_steal_sock() returns a
     listener in the reuseport group that inet_reqsk(skb->sk)->rsk_listener
     belongs to.

  3) In cookie_v[46]_check(), the reqsk (skb->sk) is fully initialised and
     a full sk is created.

The kfunc usage is as follows:

    struct bpf_tcp_req_attrs attrs = {
        .mss = mss,
        .wscale_ok = wscale_ok,
        .rcv_wscale = rcv_wscale, /* Server's WScale < 15 */
        .snd_wscale = snd_wscale, /* Client's WScale < 15 */
        .tstamp_ok = tstamp_ok,
        .rcv_tsval = tsval,
        .rcv_tsecr = tsecr, /* Server's Initial TSval */
        .usec_ts_ok = usec_ts_ok,
        .sack_ok = sack_ok,
        .ecn_ok = ecn_ok,
    }

    skc = bpf_skc_lookup_tcp(...);
    sk = (struct sock *)bpf_skc_to_tcp_sock(skc);
    bpf_sk_assign_tcp_reqsk(skb, sk, attrs, sizeof(attrs));
    bpf_sk_release(skc);

[0]: https://lpc.events/event/17/contributions/1645/attachments/1350/2701/SYN_Proxy_at_Scale_with_BPF.pdf

Changes:
  v8
    * Rebase on Yonghong's cpuv4 fix
    * Patch 5
      * Fill the trailing 3-bytes padding in struct bpf_tcp_req_attrs
        and test it as null
    * Patch 6
      * Remove unused IPPROTP_MPTCP definition

  v7: https://lore.kernel.org/bpf/20231221012806.37137-1-kuniyu@amazon.com/
    * Patch 5 & 6
      * Drop MPTCP support

  v6: https://lore.kernel.org/bpf/20231214155424.67136-1-kuniyu@amazon.com/
    * Patch 5 & 6
      * /struct /s/tcp_cookie_attributes/bpf_tcp_req_attrs/
      * Don't reuse struct tcp_options_received and use u8 for each attrs
    * Patch 6
      * Check retval of test__start_subtest()

  v5: https://lore.kernel.org/netdev/20231211073650.90819-1-kuniyu@amazon.com/
    * Split patch 1-3
    * Patch 3
      * Clear req->rsk_listener in skb_steal_sock()
    * Patch 4 & 5
      * Move sysctl validation and tsoff init from cookie_bpf_check() to kfunc
    * Patch 5
      * Do not increment LINUX_MIB_SYNCOOKIES(RECV|FAILED)
    * Patch 6
      * Remove __always_inline
      * Test if tcp_handle_{syn,ack}() is executed
      * Move some definition to bpf_tracing_net.h
      * s/BPF_F_CURRENT_NETNS/-1/

  v4: https://lore.kernel.org/bpf/20231205013420.88067-1-kuniyu@amazon.com/
    * Patch 1 & 2
      * s/CONFIG_SYN_COOKIE/CONFIG_SYN_COOKIES/
    * Patch 1
      * Don't set rcv_wscale for BPF SYN Cookie case.
    * Patch 2
      * Add test for tcp_opt.{unused,rcv_wscale} in kfunc
      * Modify skb_steal_sock() to avoid resetting skb-sk
      * Support SO_REUSEPORT lookup
    * Patch 3
      * Add CONFIG_SYN_COOKIES to Kconfig for CI
      * Define BPF_F_CURRENT_NETNS

  v3: https://lore.kernel.org/netdev/20231121184245.69569-1-kuniyu@amazon.com/
    * Guard kfunc and req->syncookie part in inet6?_steal_sock() with
      CONFIG_SYN_COOKIE

  v2: https://lore.kernel.org/netdev/20231120222341.54776-1-kuniyu@amazon.com/
    * Drop SOCK_OPS and move SYN Cookie validation logic to TC with kfunc.
    * Add cleanup patches to reduce discrepancy between cookie_v[46]_check()

  v1: https://lore.kernel.org/bpf/20231013220433.70792-1-kuniyu@amazon.com/
====================

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-23 14:42:51 -08:00
Kuniyuki Iwashima
a74712241b selftest: bpf: Test bpf_sk_assign_tcp_reqsk().
This commit adds a sample selftest to demonstrate how we can use
bpf_sk_assign_tcp_reqsk() as the backend of SYN Proxy.

The test creates IPv4/IPv6 x TCP connections and transfer messages
over them on lo with BPF tc prog attached.

The tc prog will process SYN and returns SYN+ACK with the following
ISN and TS.  In a real use case, this part will be done by other
hosts.

        MSB                                   LSB
  ISN:  | 31 ... 8 | 7 6 |   5 |    4 | 3 2 1 0 |
        |   Hash_1 | MSS | ECN | SACK |  WScale |

  TS:   | 31 ... 8 |          7 ... 0           |
        |   Random |           Hash_2           |

  WScale in SYN is reused in SYN+ACK.

The client returns ACK, and tc prog will recalculate ISN and TS
from ACK and validate SYN Cookie.

If it's valid, the prog calls kfunc to allocate a reqsk for skb and
configure the reqsk based on the argument created from SYN Cookie.

Later, the reqsk will be processed in cookie_v[46]_check() to create
a connection.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20240115205514.68364-7-kuniyu@amazon.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-23 14:40:24 -08:00
Kuniyuki Iwashima
e472f88891 bpf: tcp: Support arbitrary SYN Cookie.
This patch adds a new kfunc available at TC hook to support arbitrary
SYN Cookie.

The basic usage is as follows:

    struct bpf_tcp_req_attrs attrs = {
        .mss = mss,
        .wscale_ok = wscale_ok,
        .rcv_wscale = rcv_wscale, /* Server's WScale < 15 */
        .snd_wscale = snd_wscale, /* Client's WScale < 15 */
        .tstamp_ok = tstamp_ok,
        .rcv_tsval = tsval,
        .rcv_tsecr = tsecr, /* Server's Initial TSval */
        .usec_ts_ok = usec_ts_ok,
        .sack_ok = sack_ok,
        .ecn_ok = ecn_ok,
    }

    skc = bpf_skc_lookup_tcp(...);
    sk = (struct sock *)bpf_skc_to_tcp_sock(skc);
    bpf_sk_assign_tcp_reqsk(skb, sk, attrs, sizeof(attrs));
    bpf_sk_release(skc);

bpf_sk_assign_tcp_reqsk() takes skb, a listener sk, and struct
bpf_tcp_req_attrs and allocates reqsk and configures it.  Then,
bpf_sk_assign_tcp_reqsk() links reqsk with skb and the listener.

The notable thing here is that we do not hold refcnt for both reqsk
and listener.  To differentiate that, we mark reqsk->syncookie, which
is only used in TX for now.  So, if reqsk->syncookie is 1 in RX, it
means that the reqsk is allocated by kfunc.

When skb is freed, sock_pfree() checks if reqsk->syncookie is 1,
and in that case, we set NULL to reqsk->rsk_listener before calling
reqsk_free() as reqsk does not hold a refcnt of the listener.

When the TCP stack looks up a socket from the skb, we steal the
listener from the reqsk in skb_steal_sock() and create a full sk
in cookie_v[46]_check().

The refcnt of reqsk will finally be set to 1 in tcp_get_cookie_sock()
after creating a full sk.

Note that we can extend struct bpf_tcp_req_attrs in the future when
we add a new attribute that is determined in 3WHS.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20240115205514.68364-6-kuniyu@amazon.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-23 14:40:24 -08:00
Kuniyuki Iwashima
695751e31a bpf: tcp: Handle BPF SYN Cookie in cookie_v[46]_check().
We will support arbitrary SYN Cookie with BPF in the following
patch.

If BPF prog validates ACK and kfunc allocates a reqsk, it will
be carried to cookie_[46]_check() as skb->sk.  If skb->sk is not
NULL, we call cookie_bpf_check().

Then, we clear skb->sk and skb->destructor, which are needed not
to hold refcnt for reqsk and the listener.  See the following patch
for details.

After that, we finish initialisation for the remaining fields with
cookie_tcp_reqsk_init().

Note that the server side WScale is set only for non-BPF SYN Cookie.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20240115205514.68364-5-kuniyu@amazon.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-23 14:40:24 -08:00
Kuniyuki Iwashima
8b5ac68fb5 bpf: tcp: Handle BPF SYN Cookie in skb_steal_sock().
We will support arbitrary SYN Cookie with BPF.

If BPF prog validates ACK and kfunc allocates a reqsk, it will
be carried to TCP stack as skb->sk with req->syncookie 1.  Also,
the reqsk has its listener as req->rsk_listener with no refcnt
taken.

When the TCP stack looks up a socket from the skb, we steal
inet_reqsk(skb->sk)->rsk_listener in skb_steal_sock() so that
the skb will be processed in cookie_v[46]_check() with the
listener.

Note that we do not clear skb->sk and skb->destructor so that we
can carry the reqsk to cookie_v[46]_check().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20240115205514.68364-4-kuniyu@amazon.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-23 14:40:24 -08:00
Artem Savkov
d177c1be06 selftests/bpf: Fix potential premature unload in bpf_testmod
It is possible for bpf_kfunc_call_test_release() to be called from
bpf_map_free_deferred() when bpf_testmod is already unloaded and
perf_test_stuct.cnt which it tries to decrease is no longer in memory.
This patch tries to fix the issue by waiting for all references to be
dropped in bpf_testmod_exit().

The issue can be triggered by running 'test_progs -t map_kptr' in 6.5,
but is obscured in 6.6 by d119357d07 ("rcu-tasks: Treat only
synchronous grace periods urgently").

Fixes: 65eb006d85 ("bpf: Move kernel test kfuncs to bpf_testmod")
Signed-off-by: Artem Savkov <asavkov@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Cc: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/bpf/82f55c0e-0ec8-4fe1-8d8c-b1de07558ad9@linux.dev
Link: https://lore.kernel.org/bpf/20240110085737.8895-1-asavkov@redhat.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-23 14:40:23 -08:00
Kuniyuki Iwashima
95e752b529 tcp: Move skb_steal_sock() to request_sock.h
We will support arbitrary SYN Cookie with BPF.

If BPF prog validates ACK and kfunc allocates a reqsk, it will
be carried to TCP stack as skb->sk with req->syncookie 1.

In skb_steal_sock(), we need to check inet_reqsk(sk)->syncookie
to see if the reqsk is created by kfunc.  However, inet_reqsk()
is not available in sock.h.

Let's move skb_steal_sock() to request_sock.h.

While at it, we refactor skb_steal_sock() so it returns early if
skb->sk is NULL to minimise the following patch.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240115205514.68364-3-kuniyu@amazon.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-23 14:40:23 -08:00
Tiezhu Yang
f5f30386c7 bpftool: Silence build warning about calloc()
There exists the following warning when building bpftool:

  CC      prog.o
prog.c: In function ‘profile_open_perf_events’:
prog.c:2301:24: warning: ‘calloc’ sizes specified with ‘sizeof’ in the earlier argument and not in the later argument [-Wcalloc-transposed-args]
 2301 |                 sizeof(int), obj->rodata->num_cpu * obj->rodata->num_metric);
      |                        ^~~
prog.c:2301:24: note: earlier argument should specify number of elements, later size of each element

Tested with the latest upstream GCC which contains a new warning option
-Wcalloc-transposed-args. The first argument to calloc is documented to
be number of elements in array, while the second argument is size of each
element, just switch the first and second arguments of calloc() to silence
the build warning, compile tested only.

Fixes: 47c09d6a9f ("bpftool: Introduce "prog profile" command")
Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Quentin Monnet <quentin@isovalent.com>
Link: https://lore.kernel.org/bpf/20240116061920.31172-1-yangtiezhu@loongson.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-23 14:40:23 -08:00
Kuniyuki Iwashima
b18afb6f42 tcp: Move tcp_ns_to_ts() to tcp.h
We will support arbitrary SYN Cookie with BPF.

When BPF prog validates ACK and kfunc allocates a reqsk, we need
to call tcp_ns_to_ts() to calculate an offset of TSval for later
use:

  time
  t0 : Send SYN+ACK
       -> tsval = Initial TSval (Random Number)

  t1 : Recv ACK of 3WHS
       -> tsoff = TSecr - tcp_ns_to_ts(usec_ts_ok, tcp_clock_ns())
                = Initial TSval - t1

  t2 : Send ACK
       -> tsval = t2 + tsoff
                = Initial TSval + (t2 - t1)
                = Initial TSval + Time Delta (x)

  (x) Note that the time delta does not include the initial RTT
      from t0 to t1.

Let's move tcp_ns_to_ts() to tcp.h.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240115205514.68364-2-kuniyu@amazon.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-23 14:40:23 -08:00
Alexei Starovoitov
49c06547d5 bpf: Minor improvements for bpf_cmp.
Few minor improvements for bpf_cmp() macro:
. reduce number of args in __bpf_cmp()
. rename NOFLIP to UNLIKELY
. add a comment about 64-bit truncation in "i" constraint
. use "ri" constraint for sizeof(rhs) <= 4
. improve error message for bpf_cmp_likely()

Before:
progs/iters_task_vma.c:31:7: error: variable 'ret' is uninitialized when used here [-Werror,-Wuninitialized]
   31 |                 if (bpf_cmp_likely(seen, <==, 1000))
      |                     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
../bpf/bpf_experimental.h:325:3: note: expanded from macro 'bpf_cmp_likely'
  325 |                 ret;
      |                 ^~~
progs/iters_task_vma.c:31:7: note: variable 'ret' is declared here
../bpf/bpf_experimental.h:310:3: note: expanded from macro 'bpf_cmp_likely'
  310 |                 bool ret;
      |                 ^

After:
progs/iters_task_vma.c:31:7: error: invalid operand for instruction
   31 |                 if (bpf_cmp_likely(seen, <==, 1000))
      |                     ^
../bpf/bpf_experimental.h:324:17: note: expanded from macro 'bpf_cmp_likely'
  324 |                         asm volatile("r0 " #OP " invalid compare");
      |                                      ^
<inline asm>:1:5: note: instantiated into assembly here
    1 |         r0 <== invalid compare
      |            ^

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/bpf/20240112220134.71209-1-alexei.starovoitov@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-23 14:40:23 -08:00
Yonghong Song
88031b929c docs/bpf: Fix an incorrect statement in verifier.rst
In verifier.rst, I found an incorrect statement (maybe a typo) in section
'Liveness marks tracking'. Basically, the wrong register is attributed
to have a read mark. This may confuse the user.

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240111052136.3440417-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-23 14:40:23 -08:00
Yonghong Song
6ae99ac8b7 selftests/bpf: Add a selftest with not-8-byte aligned BPF_ST
Add a selftest with a 4 bytes BPF_ST of 0 where the store is not
8-byte aligned. The goal is to ensure that STACK_ZERO is properly
marked in stack slots and the STACK_ZERO value can propagate
properly during the load.

Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20240110051355.2737232-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-23 14:40:23 -08:00
Yonghong Song
9a4c57f52b bpf: Track aligned st store as imprecise spilled registers
With patch set [1], precision backtracing supports register spill/fill
to/from the stack. The patch [2] allows initial imprecise register spill
with content 0. This is a common case for cpuv3 and lower for
initializing the stack variables with pattern
  r1 = 0
  *(u64 *)(r10 - 8) = r1
and the [2] has demonstrated good verification improvement.

For cpuv4, the initialization could be
  *(u64 *)(r10 - 8) = 0
The current verifier marks the r10-8 contents with STACK_ZERO.
Similar to [2], let us permit the above insn to behave like
imprecise register spill which can reduce number of verified states.
The change is in function check_stack_write_fixed_off().

Before this patch, spilled zero will be marked as STACK_ZERO
which can provide precise values. In check_stack_write_var_off(),
STACK_ZERO will be maintained if writing a const zero
so later it can provide precise values if needed.

The above handling of '*(u64 *)(r10 - 8) = 0' as a spill
will have issues in check_stack_write_var_off() as the spill
will be converted to STACK_MISC and the precise value 0
is lost. To fix this issue, if the spill slots with const
zero and the BPF_ST write also with const zero, the spill slots
are preserved, which can later provide precise values
if needed. Without the change in check_stack_write_var_off(),
the test_verifier subtest 'BPF_ST_MEM stack imm zero, variable offset'
will fail.

I checked cpuv3 and cpuv4 with and without this patch with veristat.
There is no state change for cpuv3 since '*(u64 *)(r10 - 8) = 0'
is only generated with cpuv4.

For cpuv4:
$ ../veristat -C old.cpuv4.csv new.cpuv4.csv -e file,prog,insns,states -f 'insns_diff!=0'
File                                        Program              Insns (A)  Insns (B)  Insns    (DIFF)  States (A)  States (B)  States (DIFF)
------------------------------------------  -------------------  ---------  ---------  ---------------  ----------  ----------  -------------
local_storage_bench.bpf.linked3.o           get_local                  228        168    -60 (-26.32%)          17          14   -3 (-17.65%)
pyperf600_bpf_loop.bpf.linked3.o            on_event                  6066       4889  -1177 (-19.40%)         403         321  -82 (-20.35%)
test_cls_redirect.bpf.linked3.o             cls_redirect             35483      35387     -96 (-0.27%)        2179        2177    -2 (-0.09%)
test_l4lb_noinline.bpf.linked3.o            balancer_ingress          4494       4522     +28 (+0.62%)         217         219    +2 (+0.92%)
test_l4lb_noinline_dynptr.bpf.linked3.o     balancer_ingress          1432       1455     +23 (+1.61%)          92          94    +2 (+2.17%)
test_xdp_noinline.bpf.linked3.o             balancer_ingress_v6       3462       3458      -4 (-0.12%)         216         216    +0 (+0.00%)
verifier_iterating_callbacks.bpf.linked3.o  widening                    52         41    -11 (-21.15%)           4           3   -1 (-25.00%)
xdp_synproxy_kern.bpf.linked3.o             syncookie_tc             12412      11719    -693 (-5.58%)         345         330   -15 (-4.35%)
xdp_synproxy_kern.bpf.linked3.o             syncookie_xdp            12478      11794    -684 (-5.48%)         346         331   -15 (-4.34%)

test_l4lb_noinline and test_l4lb_noinline_dynptr has minor regression, but
pyperf600_bpf_loop and local_storage_bench gets pretty good improvement.

  [1] https://lore.kernel.org/all/20231205184248.1502704-1-andrii@kernel.org/
  [2] https://lore.kernel.org/all/20231205184248.1502704-9-andrii@kernel.org/

Cc: Kuniyuki Iwashima <kuniyu@amazon.com>
Cc: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Tested-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240110051348.2737007-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-23 14:40:23 -08:00
Maxim Mikityanskiy
3893f0b6a0 selftests/bpf: Test assigning ID to scalars on spill
The previous commit implemented assigning IDs to registers holding
scalars before spill. Add the test cases to check the new functionality.

Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240108205209.838365-10-maxtram95@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-23 14:40:23 -08:00
Maxim Mikityanskiy
8ecfc371d8 bpf: Assign ID to scalars on spill
Currently, when a scalar bounded register is spilled to the stack, its
ID is preserved, but only if was already assigned, i.e. if this register
was MOVed before.

Assign an ID on spill if none is set, so that equal scalars could be
tracked if a register is spilled to the stack and filled into another
register.

One test is adjusted to reflect the change in register IDs.

Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240108205209.838365-9-maxtram95@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-23 14:40:23 -08:00
Maxim Mikityanskiy
87e51ac6cb bpf: Add the get_reg_width function
Put calculation of the register value width into a dedicated function.
This function will also be used in a following commit.

Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
Link: https://lore.kernel.org/r/20240108205209.838365-8-maxtram95@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-23 14:40:23 -08:00
Maxim Mikityanskiy
8e0e074aaf bpf: Add the assign_scalar_id_before_mov function
Extract the common code that generates a register ID for src_reg before
MOV if needed into a new function. This function will also be used in
a following commit.

Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240108205209.838365-7-maxtram95@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-23 14:40:22 -08:00
Maxim Mikityanskiy
b827eee4c4 selftests/bpf: Add a test case for 32-bit spill tracking
When a range check is performed on a register that was 32-bit spilled to
the stack, the IDs of the two instances of the register are the same, so
the range should also be the same.

Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240108205209.838365-6-maxtram95@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-23 14:40:22 -08:00
Maxim Mikityanskiy
32f55dd4ad bpf: Make bpf_for_each_spilled_reg consider narrow spills
Adjust the check in bpf_get_spilled_reg to take into account spilled
registers narrower than 64 bits. That allows find_equal_scalars to
properly adjust the range of all spilled registers that have the same
ID. Before this change, it was possible for a register and a spilled
register to have the same IDs but different ranges if the spill was
narrower than 64 bits and a range check was performed on the register.

Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240108205209.838365-5-maxtram95@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-23 14:40:22 -08:00
Eduard Zingerman
c035b3e555 selftests/bpf: check if imprecise stack spills confuse infinite loop detection
Verify that infinite loop detection logic separates states with
identical register states but different imprecise scalars spilled to
stack.

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240108205209.838365-4-maxtram95@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-23 14:40:22 -08:00
Eduard Zingerman
d5b892fd60 bpf: make infinite loop detection in is_state_visited() exact
Current infinite loops detection mechanism is speculative:
- first, states_maybe_looping() check is done which simply does memcmp
  for R1-R10 in current frame;
- second, states_equal(..., exact=false) is called. With exact=false
  states_equal() would compare scalars for equality only if in old
  state scalar has precision mark.

Such logic might be problematic if compiler makes some unlucky stack
spill/fill decisions. An artificial example of a false positive looks
as follows:

        r0 = ... unknown scalar ...
        r0 &= 0xff;
        *(u64 *)(r10 - 8) = r0;
        r0 = 0;
    loop:
        r0 = *(u64 *)(r10 - 8);
        if r0 > 10 goto exit_;
        r0 += 1;
        *(u64 *)(r10 - 8) = r0;
        r0 = 0;
        goto loop;

This commit updates call to states_equal to use exact=true, forcing
all scalar comparisons to be exact.

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240108205209.838365-3-maxtram95@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-23 14:40:22 -08:00
Maxim Mikityanskiy
242d185141 selftests/bpf: Fix the u64_offset_to_skb_data test
The u64_offset_to_skb_data test is supposed to make a 64-bit fill, but
instead makes a 16-bit one. Fix the test according to its intention and
update the comments accordingly (umax is no longer 0xffff). The 16-bit
fill is covered by u16_offset_to_skb_data.

Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240108205209.838365-2-maxtram95@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-23 14:40:22 -08:00
Nathan Chancellor
f067074baf selftests/bpf: Update LLVM Phabricator links
reviews.llvm.org was LLVM's Phabricator instances for code review. It
has been abandoned in favor of GitHub pull requests. While the majority
of links in the kernel sources still work because of the work Fangrui
has done turning the dynamic Phabricator instance into a static archive,
there are some issues with that work, so preemptively convert all the
links in the kernel sources to point to the commit on GitHub.

Most of the commits have the corresponding differential review link in
the commit message itself so there should not be any loss of fidelity in
the relevant information.

Additionally, fix a typo in the xdpwall.c print ("LLMV" -> "LLVM") while
in the area.

Link: https://discourse.llvm.org/t/update-on-github-pull-requests/71540/172
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Link: https://lore.kernel.org/r/20240111-bpf-update-llvm-phabricator-links-v2-1-9a7ae976bd64@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-23 14:40:22 -08:00
Andrii Nakryiko
56d3e44af8 selftests/bpf: detect testing prog flags support
Various tests specify extra testing prog_flags when loading BPF
programs, like BPF_F_TEST_RND_HI32, and more recently also
BPF_F_TEST_REG_INVARIANTS. While BPF_F_TEST_RND_HI32 is old enough to
not cause much problem on older kernels, BPF_F_TEST_REG_INVARIANTS is
very fresh and unconditionally specifying it causes selftests to fail on
even slightly outdated kernels.

This breaks libbpf CI test against 4.9 and 5.15 kernels, it can break
some local development (done outside of VM), etc.

To prevent this, and guard against similar problems in the future, do
runtime detection of supported "testing flags", and only provide those
that host kernel recognizes.

Acked-by: Song Liu <song@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240109231738.575844-1-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-23 14:40:22 -08:00
Dave Thaler
81777efbf5 Introduce concept of conformance groups
The discussion of what the actual conformance groups should be
is still in progress, so this is just part 1 which only uses
"legacy" for deprecated instructions and "basic" for everything
else.  Subsequent patches will add more groups as discussion
continues.

Signed-off-by: Dave Thaler <dthaler1968@gmail.com>
Acked-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20240108214231.5280-1-dthaler1968@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-23 14:40:22 -08:00
Randy Dunlap
15b8b0be98 net: filter: fix spelling mistakes
Fix spelling errors as reported by codespell.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: bpf@vger.kernel.org
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240106065545.16855-1-rdunlap@infradead.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-23 14:40:22 -08:00
Andrii Nakryiko
522bb2c1f8 bpf: support multiple tags per argument
Add ability to iterate multiple decl_tag types pointed to the same
function argument. Use this to support multiple __arg_xxx tags per
global subprog argument.

We leave btf_find_decl_tag_value() intact, but change its implementation
to use a new btf_find_next_decl_tag() which can be straightforwardly
used to find next BTF type ID of a matching btf_decl_tag type.
btf_prepare_func_args() is switched from btf_find_decl_tag_value() to
btf_find_next_decl_tag() to gain multiple tags per argument support.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240105000909.2818934-5-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-23 14:40:21 -08:00
Andrii Nakryiko
54c11ec493 bpf: prepare btf_prepare_func_args() for multiple tags per argument
Add btf_arg_tag flags enum to be able to record multiple tags per
argument. Also streamline pointer argument processing some more.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240105000909.2818934-4-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-23 14:40:21 -08:00
Andrii Nakryiko
18810ad392 bpf: make sure scalar args don't accept __arg_nonnull tag
Move scalar arg processing in btf_prepare_func_args() after all pointer
arg processing is done. This makes it easier to do validation. One
example of unintended behavior right now is ability to specify
__arg_nonnull for integer/enum arguments. This patch fixes this.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240105000909.2818934-3-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-23 14:40:21 -08:00
Andrii Nakryiko
e31f98c1af selftests/bpf: fix test_loader check message
Seeing:

  process_subtest:PASS:Can't alloc specs array 0 nsec

... in verbose successful test log is very confusing. Use smaller
identifier-like test tag to denote that we are asserting specs array
allocation success.

Now it's much less distracting:

  process_subtest:PASS:specs_alloc 0 nsec

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240105000909.2818934-2-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-23 14:40:21 -08:00
Alexei Starovoitov
55c14321db Merge branch 'bpf-inline-bpf_kptr_xchg'
Hou Tao says:

====================
The motivation of inlining bpf_kptr_xchg() comes from the performance
profiling of bpf memory allocator benchmark [1]. The benchmark uses
bpf_kptr_xchg() to stash the allocated objects and to pop the stashed
objects for free. After inling bpf_kptr_xchg(), the performance for
object free on 8-CPUs VM increases about 2%~10%. However the performance
gain comes with costs: both the kasan and kcsan checks on the pointer
will be unavailable. Initially the inline is implemented in do_jit() for
x86-64 directly, but I think it will more portable to implement the
inline in verifier.

Patch #1 supports inlining bpf_kptr_xchg() helper and enables it on
x86-4. Patch #2 factors out a helper for newly-added test in patch #3.
Patch #3 tests whether the inlining of bpf_kptr_xchg() is expected.
Please see individual patches for more details. And comments are always
welcome.

Change Log:
v3:
  * rebased on bpf-next tree
  * patch 1 & 2: Add Rvb-by and Ack-by tags from Eduard
  * patch 3: use inline assembly and naked function instead of c code
             (suggested by Eduard)

v2: https://lore.kernel.org/bpf/20231223104042.1432300-1-houtao@huaweicloud.com/
  * rebased on bpf-next tree
  * drop patch #1 in v1 due to discussion in [2]
  * patch #1: add the motivation in the commit message, merge patch #1
              and #3 into the new patch in v2. (Daniel)
  * patch #2/#3: newly-added patch to test the inlining of
                 bpf_kptr_xchg() (Eduard)

v1: https://lore.kernel.org/bpf/95b8c2cd-44d5-5fe1-60b5-7e8218779566@huaweicloud.com/

[1]: https://lore.kernel.org/bpf/20231221141501.3588586-1-houtao@huaweicloud.com/
[2]: https://lore.kernel.org/bpf/fd94efb9-4a56-c982-dc2e-c66be5202cb7@huaweicloud.com/
====================

Link: https://lore.kernel.org/r/20240105104819.3916743-1-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-23 14:40:21 -08:00
Hou Tao
17bda53e43 selftests/bpf: Test the inlining of bpf_kptr_xchg()
The test uses bpf_prog_get_info_by_fd() to obtain the xlated
instructions of the program first. Since these instructions have
already been rewritten by the verifier, the tests then checks whether
the rewritten instructions are as expected. And to ensure LLVM generates
code exactly as expected, use inline assembly and a naked function.

Suggested-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Hou Tao <houtao1@huawei.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240105104819.3916743-4-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-23 14:40:21 -08:00
Hou Tao
b4b7a4099b selftests/bpf: Factor out get_xlated_program() helper
Both test_verifier and test_progs use get_xlated_program(), so moving
the helper into testing_helpers.h to reuse it.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20240105104819.3916743-3-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-23 14:40:21 -08:00
Hou Tao
7c05e7f3e7 bpf: Support inlining bpf_kptr_xchg() helper
The motivation of inlining bpf_kptr_xchg() comes from the performance
profiling of bpf memory allocator benchmark. The benchmark uses
bpf_kptr_xchg() to stash the allocated objects and to pop the stashed
objects for free. After inling bpf_kptr_xchg(), the performance for
object free on 8-CPUs VM increases about 2%~10%. The inline also has
downside: both the kasan and kcsan checks on the pointer will be
unavailable.

bpf_kptr_xchg() can be inlined by converting the calling of
bpf_kptr_xchg() into an atomic_xchg() instruction. But the conversion
depends on two conditions:
1) JIT backend supports atomic_xchg() on pointer-sized word
2) For the specific arch, the implementation of xchg is the same as
   atomic_xchg() on pointer-sized words.

It seems most 64-bit JIT backends satisfies these two conditions. But
as a precaution, defining a weak function bpf_jit_supports_ptr_xchg()
to state whether such conversion is safe and only supporting inline for
64-bit host.

For x86-64, it supports BPF_XCHG atomic operation and both xchg() and
atomic_xchg() use arch_xchg() to implement the exchange, so enabling the
inline of bpf_kptr_xchg() on x86-64 first.

Reviewed-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20240105104819.3916743-2-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-23 14:40:21 -08:00
Paolo Abeni
2121c43f88 Merge branch 'inet_diag-remove-three-mutexes-in-diag-dumps'
Eric Dumazet says:

====================
inet_diag: remove three mutexes in diag dumps

Surprisingly, inet_diag operations are serialized over a stack
of three mutexes, giving legacy /proc based files an unfair
advantage on modern hosts.

This series removes all of them, making inet_diag operations
(eg iproute2/ss) fully parallel.

1-2) Two first patches are adding data-race annotations
     and can be backported to stable kernels.

3-4) inet_diag_table_mutex can be replaced with RCU protection,
     if we add corresponding protection against module unload.

5-7) sock_diag_table_mutex can be replaced with RCU protection,
     if we add corresponding protection against module unload.

 8)  sock_diag_mutex is removed, as the old bug it was
     working around has been fixed more elegantly.

 9)  inet_diag_dump_icsk() can skip over empty buckets to reduce
     spinlock contention.
====================

Link: https://lore.kernel.org/r/20240122112603.3270097-1-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-01-23 15:18:43 +01:00
Eric Dumazet
622a08e8de inet_diag: skip over empty buckets
After the removal of inet_diag_table_mutex, sock_diag_table_mutex
and sock_diag_mutex, I was able so see spinlock contention from
inet_diag_dump_icsk() when running 100 parallel invocations.

It is time to skip over empty buckets.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-01-23 15:13:55 +01:00
Eric Dumazet
f44e64990b sock_diag: remove sock_diag_mutex
sock_diag_rcv() is still serializing its operations using
a mutex, for no good reason.

This came with commit 0a9c730144 ("[INET_DIAG]: Fix oops
in netlink_rcv_skb"), but the root cause has been fixed
with commit cd40b7d398 ("[NET]: make netlink user -> kernel
interface synchronious")

Remove this mutex to let multiple threads run concurrently.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-01-23 15:13:55 +01:00
Eric Dumazet
86e8921df0 sock_diag: allow concurrent operation in sock_diag_rcv_msg()
TCPDIAG_GETSOCK and DCCPDIAG_GETSOCK diag are serialized
on sock_diag_table_mutex.

This is to make sure inet_diag module is not unloaded
while diag was ongoing.

It is time to get rid of this mutex and use RCU protection,
allowing full parallelism.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-01-23 15:13:55 +01:00
Eric Dumazet
1d55a69747 sock_diag: allow concurrent operations
sock_diag_broadcast_destroy_work() and __sock_diag_cmd()
are currently using sock_diag_table_mutex to protect
against concurrent sock_diag_handlers[] changes.

This makes inet_diag dump serialized, thus less scalable
than legacy /proc files.

It is time to switch to full RCU protection.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-01-23 15:13:55 +01:00
Eric Dumazet
114b4bb1cc sock_diag: add module pointer to "struct sock_diag_handler"
Following patch is going to use RCU instead of
sock_diag_table_mutex acquisition.

This patch is a preparation, no change of behavior yet.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-01-23 15:13:54 +01:00
Eric Dumazet
223f55196b inet_diag: allow concurrent operations
inet_diag_lock_handler() current implementation uses a mutex
to protect inet_diag_table[] array against concurrent changes.

This makes inet_diag dump serialized, thus less scalable
than legacy /proc files.

It is time to switch to full RCU protection.

As a bonus, if a target is statically linked instead of being
modular, inet_diag_lock_handler() & inet_diag_unlock_handler()
reduce to reads only.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-01-23 15:13:54 +01:00
Eric Dumazet
db5914695a inet_diag: add module pointer to "struct inet_diag_handler"
Following patch is going to use RCU instead of
inet_diag_table_mutex acquisition.

This patch is a preparation, no change of behavior yet.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-01-23 15:13:54 +01:00
Eric Dumazet
e50e10ae5d inet_diag: annotate data-races around inet_diag_table[]
inet_diag_lock_handler() reads inet_diag_table[proto] locklessly.

Use READ_ONCE()/WRITE_ONCE() annotations to avoid potential issues.

Fixes: d523a328fb ("[INET]: Fix inet_diag dead-lock regression")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-01-23 15:13:54 +01:00