Commit Graph

39299 Commits

Author SHA1 Message Date
Daniel Müller
ec6209c8d4 bpf, libbpf: Add type match support
This patch adds support for the proposed type match relation to
relo_core where it is shared between userspace and kernel. It plumbs
through both kernel-side and libbpf-side support.

The matching relation is defined as follows (copy from source):
- modifiers and typedefs are stripped (and, hence, effectively ignored)
- generally speaking types need to be of same kind (struct vs. struct, union
  vs. union, etc.)
  - exceptions are struct/union behind a pointer which could also match a
    forward declaration of a struct or union, respectively, and enum vs.
    enum64 (see below)
Then, depending on type:
- integers:
  - match if size and signedness match
- arrays & pointers:
  - target types are recursively matched
- structs & unions:
  - local members need to exist in target with the same name
  - for each member we recursively check match unless it is already behind a
    pointer, in which case we only check matching names and compatible kind
- enums:
  - local variants have to have a match in target by symbolic name (but not
    numeric value)
  - size has to match (but enum may match enum64 and vice versa)
- function pointers:
  - number and position of arguments in local type has to match target
  - for each argument and the return value we recursively check match

Signed-off-by: Daniel Müller <deso@posteo.net>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220628160127.607834-5-deso@posteo.net
2022-07-05 21:14:25 -07:00
Stanislav Fomichev
9113d7e48e bpf: expose bpf_{g,s}etsockopt to lsm cgroup
I don't see how to make it nice without introducing btf id lists
for the hooks where these helpers are allowed. Some LSM hooks
work on the locked sockets, some are triggering early and
don't grab any locks, so have two lists for now:

1. LSM hooks which trigger under socket lock - minority of the hooks,
   but ideal case for us, we can expose existing BTF-based helpers
2. LSM hooks which trigger without socket lock, but they trigger
   early in the socket creation path where it should be safe to
   do setsockopt without any locks
3. The rest are prohibited. I'm thinking that this use-case might
   be a good gateway to sleeping lsm cgroup hooks in the future.
   We can either expose lock/unlock operations (and add tracking
   to the verifier) or have another set of bpf_setsockopt
   wrapper that grab the locks and might sleep.

Reviewed-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20220628174314.1216643-7-sdf@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-29 13:21:52 -07:00
Stanislav Fomichev
b79c9fc955 bpf: implement BPF_PROG_QUERY for BPF_LSM_CGROUP
We have two options:
1. Treat all BPF_LSM_CGROUP the same, regardless of attach_btf_id
2. Treat BPF_LSM_CGROUP+attach_btf_id as a separate hook point

I was doing (2) in the original patch, but switching to (1) here:

* bpf_prog_query returns all attached BPF_LSM_CGROUP programs
regardless of attach_btf_id
* attach_btf_id is exported via bpf_prog_info

Reviewed-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20220628174314.1216643-6-sdf@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-29 13:21:52 -07:00
Stanislav Fomichev
c0e19f2c9a bpf: minimize number of allocated lsm slots per program
Previous patch adds 1:1 mapping between all 211 LSM hooks
and bpf_cgroup program array. Instead of reserving a slot per
possible hook, reserve 10 slots per cgroup for lsm programs.
Those slots are dynamically allocated on demand and reclaimed.

struct cgroup_bpf {
	struct bpf_prog_array *    effective[33];        /*     0   264 */
	/* --- cacheline 4 boundary (256 bytes) was 8 bytes ago --- */
	struct hlist_head          progs[33];            /*   264   264 */
	/* --- cacheline 8 boundary (512 bytes) was 16 bytes ago --- */
	u8                         flags[33];            /*   528    33 */

	/* XXX 7 bytes hole, try to pack */

	struct list_head           storages;             /*   568    16 */
	/* --- cacheline 9 boundary (576 bytes) was 8 bytes ago --- */
	struct bpf_prog_array *    inactive;             /*   584     8 */
	struct percpu_ref          refcnt;               /*   592    16 */
	struct work_struct         release_work;         /*   608    72 */

	/* size: 680, cachelines: 11, members: 7 */
	/* sum members: 673, holes: 1, sum holes: 7 */
	/* last cacheline: 40 bytes */
};

Reviewed-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20220628174314.1216643-5-sdf@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-29 13:21:52 -07:00
Stanislav Fomichev
69fd337a97 bpf: per-cgroup lsm flavor
Allow attaching to lsm hooks in the cgroup context.

Attaching to per-cgroup LSM works exactly like attaching
to other per-cgroup hooks. New BPF_LSM_CGROUP is added
to trigger new mode; the actual lsm hook we attach to is
signaled via existing attach_btf_id.

For the hooks that have 'struct socket' or 'struct sock' as its first
argument, we use the cgroup associated with that socket. For the rest,
we use 'current' cgroup (this is all on default hierarchy == v2 only).
Note that for some hooks that work on 'struct sock' we still
take the cgroup from 'current' because some of them work on the socket
that hasn't been properly initialized yet.

Behind the scenes, we allocate a shim program that is attached
to the trampoline and runs cgroup effective BPF programs array.
This shim has some rudimentary ref counting and can be shared
between several programs attaching to the same lsm hook from
different cgroups.

Note that this patch bloats cgroup size because we add 211
cgroup_bpf_attach_type(s) for simplicity sake. This will be
addressed in the subsequent patch.

Also note that we only add non-sleepable flavor for now. To enable
sleepable use-cases, bpf_prog_run_array_cg has to grab trace rcu,
shim programs have to be freed via trace rcu, cgroup_bpf.effective
should be also trace-rcu-managed + maybe some other changes that
I'm not aware of.

Reviewed-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20220628174314.1216643-4-sdf@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-29 13:21:51 -07:00
Stanislav Fomichev
00442143a2 bpf: convert cgroup_bpf.progs to hlist
This lets us reclaim some space to be used by new cgroup lsm slots.

Before:
struct cgroup_bpf {
	struct bpf_prog_array *    effective[23];        /*     0   184 */
	/* --- cacheline 2 boundary (128 bytes) was 56 bytes ago --- */
	struct list_head           progs[23];            /*   184   368 */
	/* --- cacheline 8 boundary (512 bytes) was 40 bytes ago --- */
	u32                        flags[23];            /*   552    92 */

	/* XXX 4 bytes hole, try to pack */

	/* --- cacheline 10 boundary (640 bytes) was 8 bytes ago --- */
	struct list_head           storages;             /*   648    16 */
	struct bpf_prog_array *    inactive;             /*   664     8 */
	struct percpu_ref          refcnt;               /*   672    16 */
	struct work_struct         release_work;         /*   688    32 */

	/* size: 720, cachelines: 12, members: 7 */
	/* sum members: 716, holes: 1, sum holes: 4 */
	/* last cacheline: 16 bytes */
};

After:
struct cgroup_bpf {
	struct bpf_prog_array *    effective[23];        /*     0   184 */
	/* --- cacheline 2 boundary (128 bytes) was 56 bytes ago --- */
	struct hlist_head          progs[23];            /*   184   184 */
	/* --- cacheline 5 boundary (320 bytes) was 48 bytes ago --- */
	u8                         flags[23];            /*   368    23 */

	/* XXX 1 byte hole, try to pack */

	/* --- cacheline 6 boundary (384 bytes) was 8 bytes ago --- */
	struct list_head           storages;             /*   392    16 */
	struct bpf_prog_array *    inactive;             /*   408     8 */
	struct percpu_ref          refcnt;               /*   416    16 */
	struct work_struct         release_work;         /*   432    72 */

	/* size: 504, cachelines: 8, members: 7 */
	/* sum members: 503, holes: 1, sum holes: 1 */
	/* last cacheline: 56 bytes */
};

Suggested-by: Jakub Sitnicki <jakub@cloudflare.com>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Reviewed-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20220628174314.1216643-3-sdf@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-29 13:21:51 -07:00
Stanislav Fomichev
af3f413400 bpf: add bpf_func_t and trampoline helpers
I'll be adding lsm cgroup specific helpers that grab
trampoline mutex.

No functional changes.

Reviewed-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20220628174314.1216643-2-sdf@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-29 13:21:51 -07:00
Daniel Müller
fd75733da2 bpf: Merge "types_are_compat" logic into relo_core.c
BPF type compatibility checks (bpf_core_types_are_compat()) are
currently duplicated between kernel and user space. That's a historical
artifact more than intentional doing and can lead to subtle bugs where
one implementation is adjusted but another is forgotten.

That happened with the enum64 work, for example, where the libbpf side
was changed (commit 23b2a3a8f6 ("libbpf: Add enum64 relocation
support")) to use the btf_kind_core_compat() helper function but the
kernel side was not (commit 6089fb325c ("bpf: Add btf enum64
support")).

This patch addresses both the duplication issue, by merging both
implementations and moving them into relo_core.c, and fixes the alluded
to kind check (by giving preference to libbpf's already adjusted logic).

For discussion of the topic, please refer to:
https://lore.kernel.org/bpf/CAADnVQKbWR7oarBdewgOBZUPzryhRYvEbkhyPJQHHuxq=0K1gw@mail.gmail.com/T/#mcc99f4a33ad9a322afaf1b9276fb1f0b7add9665

Changelog:
v1 -> v2:
- limited libbpf recursion limit to 32
- changed name to __bpf_core_types_are_compat
- included warning previously present in libbpf version
- merged kernel and user space changes into a single patch

Signed-off-by: Daniel Müller <deso@posteo.net>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220623182934.2582827-1-deso@posteo.net
2022-06-24 14:15:37 -07:00
Eduard Zingerman
fb4e3b33e3 bpf: Fix for use-after-free bug in inline_bpf_loop
As reported by Dan Carpenter, the following statements in inline_bpf_loop()
might cause a use-after-free bug:

  struct bpf_prog *new_prog;
  // ...
  new_prog = bpf_patch_insn_data(env, position, insn_buf, *cnt);
  // ...
  env->prog->insnsi[call_insn_offset].imm = callback_offset;

The bpf_patch_insn_data() might free the memory used by env->prog.

Fixes: 1ade237119 ("bpf: Inline calls to bpf_loop when callback is known")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220624020613.548108-2-eddyz87@gmail.com
2022-06-24 16:50:39 +02:00
Simon Wang
395e942d34 bpf: Replace hard-coded 0 with BPF_K in check_alu_op
Enhance readability a bit.

Signed-off-by: Simon Wang <wangchuanguo@inspur.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20220622031923.65692-1-wangchuanguo@inspur.com
2022-06-24 15:40:42 +02:00
Jörn-Thorben Hinz
9f0265e921 bpf: Require only one of cong_avoid() and cong_control() from a TCP CC
Remove the check for required and optional functions in a struct
tcp_congestion_ops from bpf_tcp_ca.c. Rely on
tcp_register_congestion_control() to reject a BPF CC that does not
implement all required functions, as it will do for a non-BPF CC.

When a CC implements tcp_congestion_ops.cong_control(), the alternate
cong_avoid() is not in use in the TCP stack. Previously, a BPF CC was
still forced to implement cong_avoid() as a no-op since it was
non-optional in bpf_tcp_ca.c.

Signed-off-by: Jörn-Thorben Hinz <jthinz@mailbox.tu-berlin.de>
Reviewed-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/r/20220622191227.898118-3-jthinz@mailbox.tu-berlin.de
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-23 09:49:57 -07:00
Tony Ambardar
95acd8817e bpf, x64: Add predicate for bpf2bpf with tailcalls support in JIT
The BPF core/verifier is hard-coded to permit mixing bpf2bpf and tail
calls for only x86-64. Change the logic to instead rely on a new weak
function 'bool bpf_jit_supports_subprog_tailcalls(void)', which a capable
JIT backend can override.

Update the x86-64 eBPF JIT to reflect this.

Signed-off-by: Tony Ambardar <Tony.Ambardar@gmail.com>
[jakub: drop MIPS bits and tweak patch subject]
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220617105735.733938-2-jakub@cloudflare.com
2022-06-21 18:52:04 +02:00
Eduard Zingerman
1ade237119 bpf: Inline calls to bpf_loop when callback is known
Calls to `bpf_loop` are replaced with direct loops to avoid
indirection. E.g. the following:

  bpf_loop(10, foo, NULL, 0);

Is replaced by equivalent of the following:

  for (int i = 0; i < 10; ++i)
    foo(i, NULL);

This transformation could be applied when:
- callback is known and does not change during program execution;
- flags passed to `bpf_loop` are always zero.

Inlining logic works as follows:

- During execution simulation function `update_loop_inline_state`
  tracks the following information for each `bpf_loop` call
  instruction:
  - is callback known and constant?
  - are flags constant and zero?
- Function `optimize_bpf_loop` increases stack depth for functions
  where `bpf_loop` calls can be inlined and invokes `inline_bpf_loop`
  to apply the inlining. The additional stack space is used to spill
  registers R6, R7 and R8. These registers are used as loop counter,
  loop maximal bound and callback context parameter;

Measurements using `benchs/run_bench_bpf_loop.sh` inside QEMU / KVM on
i7-4710HQ CPU show a drop in latency from 14 ns/op to 2 ns/op.

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/r/20220620235344.569325-4-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-20 17:40:51 -07:00
Delyan Kratunov
aca80dd95e uprobe: gate bpf call behind BPF_EVENTS
The call into bpf from uprobes needs to be gated now that it doesn't use
the trace_events.h helpers.

Randy found this as a randconfig build failure on linux-next [1].

  [1]: https://lore.kernel.org/linux-next/2de99180-7d55-2fdf-134d-33198c27cc58@infradead.org/

Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Delyan Kratunov <delyank@fb.com>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Link: https://lore.kernel.org/r/cb8bfbbcde87ed5d811227a393ef4925f2aadb7b.camel@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-20 17:04:03 -07:00
Jakub Kicinski
9fb424c4c2 Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2022-06-17

We've added 72 non-merge commits during the last 15 day(s) which contain
a total of 92 files changed, 4582 insertions(+), 834 deletions(-).

The main changes are:

1) Add 64 bit enum value support to BTF, from Yonghong Song.

2) Implement support for sleepable BPF uprobe programs, from Delyan Kratunov.

3) Add new BPF helpers to issue and check TCP SYN cookies without binding to a
   socket especially useful in synproxy scenarios, from Maxim Mikityanskiy.

4) Fix libbpf's internal USDT address translation logic for shared libraries as
   well as uprobe's symbol file offset calculation, from Andrii Nakryiko.

5) Extend libbpf to provide an API for textual representation of the various
   map/prog/attach/link types and use it in bpftool, from Daniel Müller.

6) Provide BTF line info for RV64 and RV32 JITs, and fix a put_user bug in the
   core seen in 32 bit when storing BPF function addresses, from Pu Lehui.

7) Fix libbpf's BTF pointer size guessing by adding a list of various aliases
   for 'long' types, from Douglas Raillard.

8) Fix bpftool to readd setting rlimit since probing for memcg-based accounting
   has been unreliable and caused a regression on COS, from Quentin Monnet.

9) Fix UAF in BPF cgroup's effective program computation triggered upon BPF link
   detachment, from Tadeusz Struk.

10) Fix bpftool build bootstrapping during cross compilation which was pointing
    to the wrong AR process, from Shahab Vahedi.

11) Fix logic bug in libbpf's is_pow_of_2 implementation, from Yuze Chi.

12) BPF hash map optimization to avoid grabbing spinlocks of all CPUs when there
    is no free element. Also add a benchmark as reproducer, from Feng Zhou.

13) Fix bpftool's codegen to bail out when there's no BTF, from Michael Mullin.

14) Various minor cleanup and improvements all over the place.

* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (72 commits)
  bpf: Fix bpf_skc_lookup comment wrt. return type
  bpf: Fix non-static bpf_func_proto struct definitions
  selftests/bpf: Don't force lld on non-x86 architectures
  selftests/bpf: Add selftests for raw syncookie helpers in TC mode
  bpf: Allow the new syncookie helpers to work with SKBs
  selftests/bpf: Add selftests for raw syncookie helpers
  bpf: Add helpers to issue and check SYN cookies in XDP
  bpf: Allow helpers to accept pointers with a fixed size
  bpf: Fix documentation of th_len in bpf_tcp_{gen,check}_syncookie
  selftests/bpf: add tests for sleepable (uk)probes
  libbpf: add support for sleepable uprobe programs
  bpf: allow sleepable uprobe programs to attach
  bpf: implement sleepable uprobes by chaining gps
  bpf: move bpf_prog to bpf.h
  libbpf: Fix internal USDT address translation logic for shared libraries
  samples/bpf: Check detach prog exist or not in xdp_fwd
  selftests/bpf: Avoid skipping certain subtests
  selftests/bpf: Fix test_varlen verification failure with latest llvm
  bpftool: Do not check return value from libbpf_set_strict_mode()
  Revert "bpftool: Use libbpf 1.0 API mode instead of RLIMIT_MEMLOCK"
  ...
====================

Link: https://lore.kernel.org/r/20220617220836.7373-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-06-17 19:35:19 -07:00
Joanne Koong
dc368e1c65 bpf: Fix non-static bpf_func_proto struct definitions
This patch does two things:

1) Marks the dynptr bpf_func_proto structs that were added in [1]
   as static, as pointed out by the kernel test robot in [2].

2) There are some bpf_func_proto structs marked as extern which can
   instead be statically defined.

  [1] https://lore.kernel.org/bpf/20220523210712.3641569-1-joannelkoong@gmail.com/
  [2] https://lore.kernel.org/bpf/62ab89f2.Pko7sI08RAKdF8R6%25lkp@intel.com/

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220616225407.1878436-1-joannelkoong@gmail.com
2022-06-17 16:00:51 +02:00
Maxim Mikityanskiy
508362ac66 bpf: Allow helpers to accept pointers with a fixed size
Before this commit, the BPF verifier required ARG_PTR_TO_MEM arguments
to be followed by ARG_CONST_SIZE holding the size of the memory region.
The helpers had to check that size in runtime.

There are cases where the size expected by a helper is a compile-time
constant. Checking it in runtime is an unnecessary overhead and waste of
BPF registers.

This commit allows helpers to accept pointers to memory without the
corresponding ARG_CONST_SIZE, given that they define the memory region
size in struct bpf_func_proto and use ARG_PTR_TO_FIXED_SIZE_MEM type.

arg_size is unionized with arg_btf_id to reduce the kernel image size,
and it's valid because they are used by different argument types.

Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20220615134847.3753567-3-maximmi@nvidia.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-16 21:20:29 -07:00
Delyan Kratunov
64ad7556c7 bpf: allow sleepable uprobe programs to attach
uprobe and kprobe programs have the same program type, KPROBE, which is
currently not allowed to load sleepable programs.

To avoid adding a new UPROBE type, instead allow sleepable KPROBE
programs to load and defer the is-it-actually-a-uprobe-program check
to attachment time, where there's already validation of the
corresponding perf_event.

A corollary of this patch is that you can now load a sleepable kprobe
program but cannot attach it.

Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Delyan Kratunov <delyank@fb.com>
Link: https://lore.kernel.org/r/fcd44a7cd204f372f6bb03ef794e829adeaef299.1655248076.git.delyank@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-16 19:27:29 -07:00
Delyan Kratunov
8c7dcb84e3 bpf: implement sleepable uprobes by chaining gps
uprobes work by raising a trap, setting a task flag from within the
interrupt handler, and processing the actual work for the uprobe on the
way back to userspace. As a result, uprobe handlers already execute in a
might_fault/_sleep context. The primary obstacle to sleepable bpf uprobe
programs is therefore on the bpf side.

Namely, the bpf_prog_array attached to the uprobe is protected by normal
rcu. In order for uprobe bpf programs to become sleepable, it has to be
protected by the tasks_trace rcu flavor instead (and kfree() called after
a corresponding grace period).

Therefore, the free path for bpf_prog_array now chains a tasks_trace and
normal grace periods one after the other.

Users who iterate under tasks_trace read section would
be safe, as would users who iterate under normal read sections (from
non-sleepable locations).

The downside is that the tasks_trace latency affects all perf_event-attached
bpf programs (and not just uprobe ones). This is deemed safe given the
possible attach rates for kprobe/uprobe/tp programs.

Separately, non-sleepable programs need access to dynamically sized
rcu-protected maps, so bpf_run_prog_array_sleepables now conditionally takes
an rcu read section, in addition to the overarching tasks_trace section.

Signed-off-by: Delyan Kratunov <delyank@fb.com>
Link: https://lore.kernel.org/r/ce844d62a2fd0443b08c5ab02e95bc7149f9aeb1.1655248076.git.delyank@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-16 19:27:29 -07:00
Sami Tolvanen
57cd6d157e cfi: Fix __cfi_slowpath_diag RCU usage with cpuidle
RCU_NONIDLE usage during __cfi_slowpath_diag can result in an invalid
RCU state in the cpuidle code path:

  WARNING: CPU: 1 PID: 0 at kernel/rcu/tree.c:613 rcu_eqs_enter+0xe4/0x138
  ...
  Call trace:
    rcu_eqs_enter+0xe4/0x138
    rcu_idle_enter+0xa8/0x100
    cpuidle_enter_state+0x154/0x3a8
    cpuidle_enter+0x3c/0x58
    do_idle.llvm.6590768638138871020+0x1f4/0x2ec
    cpu_startup_entry+0x28/0x2c
    secondary_start_kernel+0x1b8/0x220
    __secondary_switched+0x94/0x98

Instead, call rcu_irq_enter/exit to wake up RCU only when needed and
disable interrupts for the entire CFI shadow/module check when we do.

Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Link: https://lore.kernel.org/r/20220531175910.890307-1-samitolvanen@google.com
Fixes: cf68fffb66 ("add support for Clang CFI")
Cc: stable@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
2022-06-13 09:18:46 -07:00
Linus Torvalds
b0cb8db396 Merge tag 'wq-for-5.19-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue fixes from Tejun Heo:
 "Tetsuo's patch to trigger build warnings if system-wide wq's are
  flushed along with a TP type update and trivial comment update"

* tag 'wq-for-5.19-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  workqueue: Switch to new kerneldoc syntax for named variable macro argument
  workqueue: Fix type of cpu in trace event
  workqueue: Wrap flush_workqueue() using a macro
2022-06-12 11:16:00 -07:00
Feng Zhou
54a9c3a42d bpf: avoid grabbing spin_locks of all cpus when no free elems
This patch use head->first in pcpu_freelist_head to check freelist
having free or not. If having, grab spin_lock, or check next cpu's
freelist.

Before patch: hash_map performance
./map_perf_test 1
0:hash_map_perf pre-alloc 1043397 events per sec
...
The average of the test results is around 1050000 events per sec.

hash_map the worst: no free
./run_bench_bpf_hashmap_full_update.sh
Setting up benchmark 'bpf-hashmap-ful-update'...
Benchmark 'bpf-hashmap-ful-update' started.
1:hash_map_full_perf 15687 events per sec
...
The average of the test results is around 16000 events per sec.

ftrace trace:
0)               |  htab_map_update_elem() {
0)               |      __pcpu_freelist_pop() {
0)               |        _raw_spin_lock()
0)               |        _raw_spin_unlock()
0)               |        ...
0) + 25.188 us   |      }
0) + 28.439 us   |  }

The test machine is 16C, trying to get spin_lock 17 times, in addition
to 16c, there is an extralist.

after patch: hash_map performance
./map_perf_test 1
0:hash_map_perf pre-alloc 1053298 events per sec
...
The average of the test results is around 1050000 events per sec.

hash_map worst: no free
./run_bench_bpf_hashmap_full_update.sh
Setting up benchmark 'bpf-hashmap-ful-update'...
Benchmark 'bpf-hashmap-ful-update' started.
1:hash_map_full_perf 555830 events per sec
...
The average of the test results is around 550000 events per sec.

ftrace trace:
0)               |  htab_map_update_elem() {
0)               |    alloc_htab_elem() {
0)   0.586 us    |      __pcpu_freelist_pop();
0)   0.945 us    |    }
0)   8.669 us    |  }

It can be seen that after adding this patch, the map performance is
almost not degraded, and when free=0, first check head->first instead of
directly acquiring spin_lock.

Co-developed-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Feng Zhou <zhoufeng.zf@bytedance.com>
Link: https://lore.kernel.org/r/20220610023308.93798-2-zhoufeng.zf@bytedance.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-11 14:25:35 -07:00
Linus Torvalds
1bc27dec7e Merge tag 'pm-5.19-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
 "These fix an intel_idle issue introduced during the 5.16 development
  cycle and two recent regressions in the system reboot/poweroff code.

  Specifics:

   - Fix CPUIDLE_FLAG_IRQ_ENABLE handling in intel_idle (Peter Zijlstra)

   - Allow all platforms to use the global poweroff handler and make
     non-syscall poweroff code paths work again (Dmitry Osipenko)"

* tag 'pm-5.19-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  cpuidle,intel_idle: Fix CPUIDLE_FLAG_IRQ_ENABLE
  kernel/reboot: Fix powering off using a non-syscall code paths
  kernel/reboot: Use static handler for register_platform_power_off()
2022-06-10 11:49:27 -07:00
Rafael J. Wysocki
67e59f8d01 Merge branch 'pm-sysoff'
Merge fixes for regressions introduced by the recent rework of the
system reboot/poweroff code.

* pm-sysoff:
  kernel/reboot: Fix powering off using a non-syscall code paths
  kernel/reboot: Use static handler for register_platform_power_off()
2022-06-10 20:24:10 +02:00
Linus Torvalds
f2ecc964b9 Merge tag 'for-linus-5.19a-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen updates from Juergen Gross:

 - a small cleanup removing "export" of an __init function

 - a small series adding a new infrastructure for platform flags

 - a series adding generic virtio support for Xen guests (frontend side)

* tag 'for-linus-5.19a-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
  xen: unexport __init-annotated xen_xlate_map_ballooned_pages()
  arm/xen: Assign xen-grant DMA ops for xen-grant DMA devices
  xen/grant-dma-ops: Retrieve the ID of backend's domain for DT devices
  xen/grant-dma-iommu: Introduce stub IOMMU driver
  dt-bindings: Add xen,grant-dma IOMMU description for xen-grant DMA ops
  xen/virtio: Enable restricted memory access using Xen grant mappings
  xen/grant-dma-ops: Add option to restrict memory access under Xen
  xen/grants: support allocating consecutive grants
  arm/xen: Introduce xen_setup_dma_ops()
  virtio: replace arch_has_restricted_virtio_memory_access()
  kernel: add platform_has() infrastructure
2022-06-10 09:57:11 -07:00
Linus Torvalds
825464e79d Merge tag 'net-5.19-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
 "Including fixes from bpf and netfilter.

  Current release - regressions:

   - eth: amt: fix possible null-ptr-deref in amt_rcv()

  Previous releases - regressions:

   - tcp: use alloc_large_system_hash() to allocate table_perturb

   - af_unix: fix a data-race in unix_dgram_peer_wake_me()

   - nfc: st21nfca: fix memory leaks in EVT_TRANSACTION handling

   - eth: ixgbe: fix unexpected VLAN rx in promisc mode on VF

  Previous releases - always broken:

   - ipv6: fix signed integer overflow in __ip6_append_data

   - netfilter:
       - nat: really support inet nat without l3 address
       - nf_tables: memleak flow rule from commit path

   - bpf: fix calling global functions from BPF_PROG_TYPE_EXT programs

   - openvswitch: fix misuse of the cached connection on tuple changes

   - nfc: nfcmrvl: fix memory leak in nfcmrvl_play_deferred

   - eth: altera: fix refcount leak in altera_tse_mdio_create

  Misc:

   - add Quentin Monnet to bpftool maintainers"

* tag 'net-5.19-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (45 commits)
  net: amd-xgbe: fix clang -Wformat warning
  tcp: use alloc_large_system_hash() to allocate table_perturb
  net: dsa: realtek: rtl8365mb: fix GMII caps for ports with internal PHY
  net: dsa: mv88e6xxx: correctly report serdes link failure
  net: dsa: mv88e6xxx: fix BMSR error to be consistent with others
  net: dsa: mv88e6xxx: use BMSR_ANEGCOMPLETE bit for filling an_complete
  net: altera: Fix refcount leak in altera_tse_mdio_create
  net: openvswitch: fix misuse of the cached connection on tuple changes
  net: ethernet: mtk_eth_soc: fix misuse of mem alloc interface netdev[napi]_alloc_frag
  ip_gre: test csum_start instead of transport header
  au1000_eth: stop using virt_to_bus()
  ipv6: Fix signed integer overflow in l2tp_ip6_sendmsg
  ipv6: Fix signed integer overflow in __ip6_append_data
  nfc: nfcmrvl: Fix memory leak in nfcmrvl_play_deferred
  nfc: st21nfca: fix incorrect sizing calculations in EVT_TRANSACTION
  nfc: st21nfca: fix memory leaks in EVT_TRANSACTION handling
  nfc: st21nfca: fix incorrect validating logic in EVT_TRANSACTION
  net: ipv6: unexport __init-annotated seg6_hmac_init()
  net: xfrm: unexport __init-annotated xfrm4_protocol_init()
  net: mdio: unexport __init-annotated mdio_bus_init()
  ...
2022-06-09 12:06:52 -07:00
Linus Torvalds
34f4335c16 Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM fixes from Paolo Bonzini:

 - syzkaller NULL pointer dereference

 - TDP MMU performance issue with disabling dirty logging

 - 5.14 regression with SVM TSC scaling

 - indefinite stall on applying live patches

 - unstable selftest

 - memory leak from wrong copy-and-paste

 - missed PV TLB flush when racing with emulation

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: x86: do not report a vCPU as preempted outside instruction boundaries
  KVM: x86: do not set st->preempted when going back to user space
  KVM: SVM: fix tsc scaling cache logic
  KVM: selftests: Make hyperv_clock selftest more stable
  KVM: x86/MMU: Zap non-leaf SPTEs when disabling dirty logging
  x86: drop bogus "cc" clobber from __try_cmpxchg_user_asm()
  KVM: x86/mmu: Check every prev_roots in __kvm_mmu_free_obsolete_roots()
  entry/kvm: Exit to user mode when TIF_NOTIFY_SIGNAL is set
  KVM: Don't null dereference ops->destroy
2022-06-08 09:16:31 -07:00
Dmitry Osipenko
2b8c612c61 kernel/reboot: Fix powering off using a non-syscall code paths
There are other methods of powering off machine than the reboot syscall.
Previously we missed to cover those methods and it created power-off
regression for some machines, like the PowerPC e500.

Fix this problem by moving the legacy sys-off handler registration to
the latest phase of power-off process and making the kernel_can_power_off()
check the legacy pm_power_off presence.

Tested-by: Michael Ellerman <mpe@ellerman.id.au> # ppce500
Reported-by: Michael Ellerman <mpe@ellerman.id.au> # ppce500
Fixes: da007f171f ("kernel/reboot: Change registration order of legacy power-off handler")
Signed-off-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2022-06-07 19:42:31 +02:00
Toke Høiland-Jørgensen
f858c2b2ca bpf: Fix calling global functions from BPF_PROG_TYPE_EXT programs
The verifier allows programs to call global functions as long as their
argument types match, using BTF to check the function arguments. One of the
allowed argument types to such global functions is PTR_TO_CTX; however the
check for this fails on BPF_PROG_TYPE_EXT functions because the verifier
uses the wrong type to fetch the vmlinux BTF ID for the program context
type. This failure is seen when an XDP program is loaded using
libxdp (which loads it as BPF_PROG_TYPE_EXT and attaches it to a global XDP
type program).

Fix the issue by passing in the target program type instead of the
BPF_PROG_TYPE_EXT type to bpf_prog_get_ctx() when checking function
argument compatibility.

The first Fixes tag refers to the latest commit that touched the code in
question, while the second one points to the code that first introduced
the global function call verification.

v2:
- Use resolve_prog_type()

Fixes: 3363bd0cfb ("bpf: Extend kfunc with PTR_TO_CTX, PTR_TO_MEM argument support")
Fixes: 51c39bb1d5 ("bpf: Introduce function-by-function verification")
Reported-by: Simon Sundberg <simon.sundberg@kau.se>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/r/20220606075253.28422-1-toke@redhat.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-07 10:41:20 -07:00
Dan Carpenter
fd58f7df24 bpf: Use safer kvmalloc_array() where possible
The kvmalloc_array() function is safer because it has a check for
integer overflows.  These sizes come from the user and I was not
able to see any bounds checking so an integer overflow seems like a
realistic concern.

Fixes: 0dcac27254 ("bpf: Add multi kprobe link")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/Yo9VRVMeHbALyjUH@kili
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-07 10:40:53 -07:00
Yonghong Song
6089fb325c bpf: Add btf enum64 support
Currently, BTF only supports upto 32bit enum value with BTF_KIND_ENUM.
But in kernel, some enum indeed has 64bit values, e.g.,
in uapi bpf.h, we have
  enum {
        BPF_F_INDEX_MASK                = 0xffffffffULL,
        BPF_F_CURRENT_CPU               = BPF_F_INDEX_MASK,
        BPF_F_CTXLEN_MASK               = (0xfffffULL << 32),
  };
In this case, BTF_KIND_ENUM will encode the value of BPF_F_CTXLEN_MASK
as 0, which certainly is incorrect.

This patch added a new btf kind, BTF_KIND_ENUM64, which permits
64bit value to cover the above use case. The BTF_KIND_ENUM64 has
the following three fields followed by the common type:
  struct bpf_enum64 {
    __u32 nume_off;
    __u32 val_lo32;
    __u32 val_hi32;
  };
Currently, btf type section has an alignment of 4 as all element types
are u32. Representing the value with __u64 will introduce a pad
for bpf_enum64 and may also introduce misalignment for the 64bit value.
Hence, two members of val_hi32 and val_lo32 are chosen to avoid these issues.

The kflag is also introduced for BTF_KIND_ENUM and BTF_KIND_ENUM64
to indicate whether the value is signed or unsigned. The kflag intends
to provide consistent output of BTF C fortmat with the original
source code. For example, the original BTF_KIND_ENUM bit value is 0xffffffff.
The format C has two choices, printing out 0xffffffff or -1 and current libbpf
prints out as unsigned value. But if the signedness is preserved in btf,
the value can be printed the same as the original source code.
The kflag value 0 means unsigned values, which is consistent to the default
by libbpf and should also cover most cases as well.

The new BTF_KIND_ENUM64 is intended to support the enum value represented as
64bit value. But it can represent all BTF_KIND_ENUM values as well.
The compiler ([1]) and pahole will generate BTF_KIND_ENUM64 only if the value has
to be represented with 64 bits.

In addition, a static inline function btf_kind_core_compat() is introduced which
will be used later when libbpf relo_core.c changed. Here the kernel shares the
same relo_core.c with libbpf.

  [1] https://reviews.llvm.org/D124641

Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20220607062600.3716578-1-yhs@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-07 10:20:42 -07:00
Tetsuo Handa
c4f135d643 workqueue: Wrap flush_workqueue() using a macro
Since flush operation synchronously waits for completion, flushing
system-wide WQs (e.g. system_wq) might introduce possibility of deadlock
due to unexpected locking dependency. Tejun Heo commented at [1] that it
makes no sense at all to call flush_workqueue() on the shared WQs as the
caller has no idea what it's gonna end up waiting for.

Although there is flush_scheduled_work() which flushes system_wq WQ with
"Think twice before calling this function! It's very easy to get into
trouble if you don't take great care." warning message, syzbot found a
circular locking dependency caused by flushing system_wq WQ [2].

Therefore, let's change the direction to that developers had better use
their local WQs if flush_scheduled_work()/flush_workqueue(system_*_wq) is
inevitable.

Steps for converting system-wide WQs into local WQs are explained at [3],
and a conversion to stop flushing system-wide WQs is in progress. Now we
want some mechanism for preventing developers who are not aware of this
conversion from again start flushing system-wide WQs.

Since I found that WARN_ON() is complete but awkward approach for teaching
developers about this problem, let's use __compiletime_warning() for
incomplete but handy approach. For completeness, we will also insert
WARN_ON() into __flush_workqueue() after all in-tree users stopped calling
flush_scheduled_work().

Link: https://lore.kernel.org/all/YgnQGZWT%2Fn3VAITX@slm.duckdns.org/ [1]
Link: https://syzkaller.appspot.com/bug?extid=bde0f89deacca7c765b8 [2]
Link: https://lkml.kernel.org/r/49925af7-78a8-a3dd-bce6-cfc02e1a9236@I-love.SAKURA.ne.jp [3]
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Tejun Heo <tj@kernel.org>
2022-06-07 07:07:14 -10:00
Seth Forshee
3e684903a8 entry/kvm: Exit to user mode when TIF_NOTIFY_SIGNAL is set
A livepatch transition may stall indefinitely when a kvm vCPU is heavily
loaded. To the host, the vCPU task is a user thread which is spending a
very long time in the ioctl(KVM_RUN) syscall. During livepatch
transition, set_notify_signal() will be called on such tasks to
interrupt the syscall so that the task can be transitioned. This
interrupts guest execution, but when xfer_to_guest_mode_work() sees that
TIF_NOTIFY_SIGNAL is set but not TIF_SIGPENDING it concludes that an
exit to user mode is unnecessary, and guest execution is resumed without
transitioning the task for the livepatch.

This handling of TIF_NOTIFY_SIGNAL is incorrect, as set_notify_signal()
is expected to break tasks out of interruptible kernel loops and cause
them to return to userspace. Change xfer_to_guest_mode_work() to handle
TIF_NOTIFY_SIGNAL the same as TIF_SIGPENDING, signaling to the vCPU run
loop that an exit to userpsace is needed. Any pending task_work will be
run when get_signal() is called from exit_to_user_mode_loop(), so there
is no longer any need to run task work from xfer_to_guest_mode_work().

Suggested-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Petr Mladek <pmladek@suse.com>
Signed-off-by: Seth Forshee <sforshee@digitalocean.com>
Message-Id: <20220504180840.2907296-1-sforshee@digitalocean.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-07 11:19:00 -04:00
Linus Torvalds
e71e60cd74 Merge tag 'dma-mapping-5.19-2022-06-06' of git://git.infradead.org/users/hch/dma-mapping
Pull dma-mapping fixes from Christoph Hellwig:

 - fix a regressin in setting swiotlb ->force_bounce (me)

 - make dma-debug less chatty (Rob Clark)

* tag 'dma-mapping-5.19-2022-06-06' of git://git.infradead.org/users/hch/dma-mapping:
  swiotlb: fix setting ->force_bounce
  dma-debug: make things less spammy under memory pressure
2022-06-06 17:56:18 -07:00
Juergen Gross
2130a790ca kernel: add platform_has() infrastructure
Add a simple infrastructure for setting, resetting and querying
platform feature flags.

Flags can be either global or architecture specific.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
Tested-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> # Arm64 only
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Juergen Gross <jgross@suse.com>
2022-06-06 08:06:00 +02:00
Linus Torvalds
e17fee8976 Merge tag 'mm-nonmm-stable-2022-06-05' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull delay-accounting update from Andrew Morton:
 "A single featurette for delay accounting.

  Delayed a bit because, unusually, it had dependencies on both the
  mm-stable and mm-nonmm-stable queues"

* tag 'mm-nonmm-stable-2022-06-05' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  delayacct: track delays from write-protect copy
2022-06-05 16:58:27 -07:00
Linus Torvalds
bc1e02c3e5 Merge tag 'sched-urgent-2022-06-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fix from Thomas Gleixner:
 "Fix the fallout of sysctl code move which placed the init function
  wrong"

* tag 'sched-urgent-2022-06-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/autogroup: Fix sysctl move
2022-06-05 10:42:40 -07:00
Linus Torvalds
fa11c28046 Merge tag 'perf-urgent-2022-06-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Thomas Gleixner:

  - Make the ICL event constraints match reality

  - Remove a unused local variable

* tag 'perf-urgent-2022-06-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/core: Remove unused local variable
  perf/x86/intel: Fix event constraints for ICL
2022-06-05 10:40:31 -07:00
Linus Torvalds
cbd76edeab Merge tag 'pull-18-rc1-work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull mount handling updates from Al Viro:
 "Cleanups (and one fix) around struct mount handling.

  The fix is usermode_driver.c one - once you've done kern_mount(), you
  must kern_unmount(); simple mntput() will end up with a leak. Several
  failure exits in there messed up that way... In practice you won't hit
  those particular failure exits without fault injection, though"

* tag 'pull-18-rc1-work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  move mount-related externs from fs.h to mount.h
  blob_to_mnt(): kern_unmount() is needed to undo kern_mount()
  m->mnt_root->d_inode->i_sb is a weird way to spell m->mnt_sb...
  linux/mount.h: trim includes
  uninline may_mount() and don't opencode it in fspick(2)/fsopen(2)
2022-06-04 19:00:05 -07:00
Linus Torvalds
67850b7bdc Merge tag 'ptrace_stop-cleanup-for-v5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull ptrace_stop cleanups from Eric Biederman:
 "While looking at the ptrace problems with PREEMPT_RT and the problems
  Peter Zijlstra was encountering with ptrace in his freezer rewrite I
  identified some cleanups to ptrace_stop that make sense on their own
  and move make resolving the other problems much simpler.

  The biggest issue is the habit of the ptrace code to change
  task->__state from the tracer to suppress TASK_WAKEKILL from waking up
  the tracee. No other code in the kernel does that and it is straight
  forward to update signal_wake_up and friends to make that unnecessary.

  Peter's task freezer sets frozen tasks to a new state TASK_FROZEN and
  then it stores them by calling "wake_up_state(t, TASK_FROZEN)" relying
  on the fact that all stopped states except the special stop states can
  tolerate spurious wake up and recover their state.

  The state of stopped and traced tasked is changed to be stored in
  task->jobctl as well as in task->__state. This makes it possible for
  the freezer to recover tasks in these special states, as well as
  serving as a general cleanup. With a little more work in that
  direction I believe TASK_STOPPED can learn to tolerate spurious wake
  ups and become an ordinary stop state.

  The TASK_TRACED state has to remain a special state as the registers
  for a process are only reliably available when the process is stopped
  in the scheduler. Fundamentally ptrace needs acess to the saved
  register values of a task.

  There are bunch of semi-random ptrace related cleanups that were found
  while looking at these issues.

  One cleanup that deserves to be called out is from commit 57b6de08b5
  ("ptrace: Admit ptrace_stop can generate spuriuos SIGTRAPs"). This
  makes a change that is technically user space visible, in the handling
  of what happens to a tracee when a tracer dies unexpectedly. According
  to our testing and our understanding of userspace nothing cares that
  spurious SIGTRAPs can be generated in that case"

* tag 'ptrace_stop-cleanup-for-v5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  sched,signal,ptrace: Rework TASK_TRACED, TASK_STOPPED state
  ptrace: Always take siglock in ptrace_resume
  ptrace: Don't change __state
  ptrace: Admit ptrace_stop can generate spuriuos SIGTRAPs
  ptrace: Document that wait_task_inactive can't fail
  ptrace: Reimplement PTRACE_KILL by always sending SIGKILL
  signal: Use lockdep_assert_held instead of assert_spin_locked
  ptrace: Remove arch_ptrace_attach
  ptrace/xtensa: Replace PT_SINGLESTEP with TIF_SINGLESTEP
  ptrace/um: Replace PT_DTRACE with TIF_SINGLESTEP
  signal: Replace __group_send_sig_info with send_signal_locked
  signal: Rename send_signal send_signal_locked
2022-06-03 16:13:25 -07:00
Linus Torvalds
1ec6574a3c Merge tag 'kthread-cleanups-for-v5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull kthread updates from Eric Biederman:
 "This updates init and user mode helper tasks to be ordinary user mode
  tasks.

  Commit 40966e316f ("kthread: Ensure struct kthread is present for
  all kthreads") caused init and the user mode helper threads that call
  kernel_execve to have struct kthread allocated for them. This struct
  kthread going away during execve in turned made a use after free of
  struct kthread possible.

  Here, commit 343f4c49f2 ("kthread: Don't allocate kthread_struct for
  init and umh") is enough to fix the use after free and is simple
  enough to be backportable.

  The rest of the changes pass struct kernel_clone_args to clean things
  up and cause the code to make sense.

  In making init and the user mode helpers tasks purely user mode tasks
  I ran into two complications. The function task_tick_numa was
  detecting tasks without an mm by testing for the presence of
  PF_KTHREAD. The initramfs code in populate_initrd_image was using
  flush_delayed_fput to ensuere the closing of all it's file descriptors
  was complete, and flush_delayed_fput does not work in a userspace
  thread.

  I have looked and looked and more complications and in my code review
  I have not found any, and neither has anyone else with the code
  sitting in linux-next"

* tag 'kthread-cleanups-for-v5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  sched: Update task_tick_numa to ignore tasks without an mm
  fork: Stop allowing kthreads to call execve
  fork: Explicitly set PF_KTHREAD
  init: Deal with the init process being a user mode process
  fork: Generalize PF_IO_WORKER handling
  fork: Explicity test for idle tasks in copy_thread
  fork: Pass struct kernel_clone_args into copy_thread
  kthread: Don't allocate kthread_struct for init and umh
2022-06-03 16:03:05 -07:00
Linus Torvalds
21873bd66b Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 fixes from Catalin Marinas:
 "Most of issues addressed were introduced during this merging window.

   - Initialise jump labels before setup_machine_fdt(), needed by commit
     f5bda35fba ("random: use static branch for crng_ready()").

   - Sparse warnings: missing prototype, incorrect __user annotation.

   - Skip SVE kselftest if not sufficient vector lengths supported"

* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
  kselftest/arm64: signal: Skip SVE signal test if not enough VLs supported
  arm64: Initialize jump labels before setup_machine_fdt()
  arm64: hibernate: Fix syntax errors in comments
  arm64: Remove the __user annotation for the restore_za_context() argument
  ftrace/fgraph: fix increased missing-prototypes warnings
2022-06-03 14:05:34 -07:00
Tadeusz Struk
4c46091ee9 bpf: Fix KASAN use-after-free Read in compute_effective_progs
Syzbot found a Use After Free bug in compute_effective_progs().
The reproducer creates a number of BPF links, and causes a fault
injected alloc to fail, while calling bpf_link_detach on them.
Link detach triggers the link to be freed by bpf_link_free(),
which calls __cgroup_bpf_detach() and update_effective_progs().
If the memory allocation in this function fails, the function restores
the pointer to the bpf_cgroup_link on the cgroup list, but the memory
gets freed just after it returns. After this, every subsequent call to
update_effective_progs() causes this already deallocated pointer to be
dereferenced in prog_list_length(), and triggers KASAN UAF error.

To fix this issue don't preserve the pointer to the prog or link in the
list, but remove it and replace it with a dummy prog without shrinking
the table. The subsequent call to __cgroup_bpf_detach() or
__cgroup_bpf_detach() will correct it.

Fixes: af6eea5743 ("bpf: Implement bpf_link-based cgroup BPF program attachment")
Reported-by: <syzbot+f264bffdfbd5614f3bb2@syzkaller.appspotmail.com>
Signed-off-by: Tadeusz Struk <tadeusz.struk@linaro.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Cc: <stable@vger.kernel.org>
Link: https://syzkaller.appspot.com/bug?id=8ebf179a95c2a2670f7cf1ba62429ec044369db4
Link: https://lore.kernel.org/bpf/20220517180420.87954-1-tadeusz.struk@linaro.org
2022-06-02 16:26:47 -07:00
Pu Lehui
cc1685546d bpf: Correct the comment about insn_to_jit_off
The insn_to_jit_off passed to bpf_prog_fill_jited_linfo should be the
first byte of the next instruction, or the byte off to the end of the
current instruction.

Signed-off-by: Pu Lehui <pulehui@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220530092815.1112406-4-pulehui@huawei.com
2022-06-02 16:26:06 -07:00
Pu Lehui
2cd0085227 bpf: Unify data extension operation of jited_ksyms and jited_linfo
We found that 32-bit environment can not print BPF line info due to a data
inconsistency between jited_ksyms[0] and jited_linfo[0].

For example:

  jited_kyms[0] = 0xb800067c, jited_linfo[0] = 0xffffffffb800067c

We know that both of them store BPF func address, but due to the different
data extension operations when extended to u64, they may not be the same.
We need to unify the data extension operations of them.

Signed-off-by: Pu Lehui <pulehui@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/CAEf4BzZ-eDcdJZgJ+Np7Y=V-TVjDDvOMqPwzKjyWrh=i5juv4w@mail.gmail.com
Link: https://lore.kernel.org/bpf/20220530092815.1112406-2-pulehui@huawei.com
2022-06-02 16:25:57 -07:00
Linus Torvalds
58f9d52ff6 Merge tag 'net-5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
 "Including fixes from bpf and netfilter.

  Current release - new code bugs:

   - af_packet: make sure to pull the MAC header, avoid skb panic in GSO

   - ptp_clockmatrix: fix inverted logic in is_single_shot()

   - netfilter: flowtable: fix missing FLOWI_FLAG_ANYSRC flag

   - dt-bindings: net: adin: fix adi,phy-output-clock description syntax

   - wifi: iwlwifi: pcie: rename CAUSE macro, avoid MIPS build warning

  Previous releases - regressions:

   - Revert "net: af_key: add check for pfkey_broadcast in function
     pfkey_process"

   - tcp: fix tcp_mtup_probe_success vs wrong snd_cwnd

   - nf_tables: disallow non-stateful expression in sets earlier

   - nft_limit: clone packet limits' cost value

   - nf_tables: double hook unregistration in netns path

   - ping6: fix ping -6 with interface name

  Previous releases - always broken:

   - sched: fix memory barriers to prevent skbs from getting stuck in
     lockless qdiscs

   - neigh: set lower cap for neigh_managed_work rearming, avoid
     constantly scheduling the probe work

   - bpf: fix probe read error on big endian in ___bpf_prog_run()

   - amt: memory leak and error handling fixes

  Misc:

   - ipv6: expand & rename accept_unsolicited_na to accept_untracked_na"

* tag 'net-5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (80 commits)
  net/af_packet: make sure to pull mac header
  net: add debug info to __skb_pull()
  net: CONFIG_DEBUG_NET depends on CONFIG_NET
  stmmac: intel: Add RPL-P PCI ID
  net: stmmac: use dev_err_probe() for reporting mdio bus registration failure
  tipc: check attribute length for bearer name
  ice: fix access-beyond-end in the switch code
  nfp: remove padding in nfp_nfdk_tx_desc
  ax25: Fix ax25 session cleanup problems
  net: usb: qmi_wwan: Add support for Cinterion MV31 with new baseline
  sfc/siena: fix wrong tx channel offset with efx_separate_tx_channels
  sfc/siena: fix considering that all channels have TX queues
  socket: Don't use u8 type in uapi socket.h
  net/sched: act_api: fix error code in tcf_ct_flow_table_fill_tuple_ipv6()
  net: ping6: Fix ping -6 with interface name
  macsec: fix UAF bug for real_dev
  octeontx2-af: fix error code in is_valid_offset()
  wifi: mac80211: fix use-after-free in chanctx code
  bonding: guard ns_targets by CONFIG_IPV6
  tcp: tcp_rtx_synack() can be called from process context
  ...
2022-06-02 12:50:16 -07:00
Saravana Kannan
73503963b7 module: Fix prefix for module.sig_enforce module param
Commit cfc1d27789 ("module: Move all into module/") changed the prefix
of the module param by moving/renaming files.  A later commit also moves
the module_param() into a different file, thereby changing the prefix
yet again.

This would break kernel cmdline compatibility and also userspace
compatibility at /sys/module/module/parameters/sig_enforce.

So, set the prefix back to "module.".

Fixes: cfc1d27789 ("module: Move all into module/")
Link: https://lore.kernel.org/lkml/20220602034111.4163292-1-saravanak@google.com/
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Aaron Tomlin <atomlin@redhat.com>
Acked-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Saravana Kannan <saravanak@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-06-02 12:44:33 -07:00
Dmitry Osipenko
587b9bfe06 kernel/reboot: Use static handler for register_platform_power_off()
The register_platform_power_off() fails on m68k platform due to the
memory allocation error that happens at a very early boot time when
memory allocator isn't available yet. Fix it by using a static sys-off
handler for the platform-level power-off handlers.

Fixes: f0f7e5265b ("m68k: Switch to new sys-off handler API")
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org>
Tested-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2022-06-02 20:36:13 +02:00
Linus Torvalds
7c9e960c63 Merge tag 'livepatching-for-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/livepatching/livepatching
Pull livepatching cleanup from Petr Mladek:

 - Remove duplicated livepatch code [Christophe]

* tag 'livepatching-for-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/livepatching/livepatching:
  livepatch: Remove klp_arch_set_pc() and asm/livepatch.h
2022-06-02 08:55:01 -07:00
Linus Torvalds
12831f6486 Merge tag 'printk-for-5.19-fixup' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux
Pull printk fixup from Petr Mladek:

 - Revert inappropriate use of wake_up_interruptible_all() in printk()

* tag 'printk-for-5.19-fixup' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux:
  Revert "printk: wake up all waiters"
2022-06-02 08:49:54 -07:00