Commit Graph

131220 Commits

Author SHA1 Message Date
Martin KaFai Lau
3cee6fb8e6 bpf: tcp: Support bpf_(get|set)sockopt in bpf tcp iter
This patch allows bpf tcp iter to call bpf_(get|set)sockopt.
To allow a specific bpf iter (tcp here) to call a set of helpers,
get_func_proto function pointer is added to bpf_iter_reg.
The bpf iter is a tracing prog which currently requires
CAP_PERFMON or CAP_SYS_ADMIN, so this patch does not
impose other capability checks for bpf_(get|set)sockopt.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210701200619.1036715-1-kafai@fb.com
2021-07-23 16:45:07 -07:00
Martin KaFai Lau
05c0b35709 tcp: seq_file: Replace listening_hash with lhash2
This patch moves the tcp seq_file iteration on listeners
from the port only listening_hash to the port+addr lhash2.

When iterating from the bpf iter, the next patch will need to
lock the socket such that the bpf iter can call setsockopt (e.g. to
change the TCP_CONGESTION).  To avoid locking the bucket and then locking
the sock, the bpf iter will first batch some sockets from the same bucket
and then unlock the bucket.  If the bucket size is small (which
usually is), it is easier to batch the whole bucket such that it is less
likely to miss a setsockopt on a socket due to changes in the bucket.

However, the port only listening_hash could have many listeners
hashed to a bucket (e.g. many individual VIP(s):443 and also
multiple by the number of SO_REUSEPORT).  We have seen bucket size in
tens of thousands range.  Also, the chance of having changes
in some popular port buckets (e.g. 443) is also high.

The port+addr lhash2 was introduced to solve this large listener bucket
issue.  Also, the listening_hash usage has already been replaced with
lhash2 in the fast path inet[6]_lookup_listener().  This patch follows
the same direction on moving to lhash2 and iterates the lhash2
instead of listening_hash.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210701200606.1035783-1-kafai@fb.com
2021-07-23 16:44:57 -07:00
Martin KaFai Lau
62001372c2 bpf: tcp: seq_file: Remove bpf_seq_afinfo from tcp_iter_state
A following patch will create a separate struct to store extra
bpf_iter state and it will embed the existing tcp_iter_state like this:
struct bpf_tcp_iter_state {
	struct tcp_iter_state state;
	/* More bpf_iter specific states here ... */
}

As a prep work, this patch removes the
"struct tcp_seq_afinfo *bpf_seq_afinfo" where its purpose is
to tell if it is iterating from bpf_iter instead of proc fs.
Currently, if "*bpf_seq_afinfo" is not NULL, it is iterating from
bpf_iter.  The kernel should not filter by the addr family and
leave this filtering decision to the bpf prog.

Instead of adding a "*bpf_seq_afinfo" pointer, this patch uses the
"seq->op == &bpf_iter_tcp_seq_ops" test to tell if it is iterating
from the bpf iter.

The bpf_iter_(init|fini)_tcp() is left here to prepare for
the change of a following patch.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210701200554.1034982-1-kafai@fb.com
2021-07-23 16:44:41 -07:00
Andrii Nakryiko
c7603cfa04 bpf: Add ambient BPF runtime context stored in current
b910eaaaa4 ("bpf: Fix NULL pointer dereference in bpf_get_local_storage()
helper") fixed the problem with cgroup-local storage use in BPF by
pre-allocating per-CPU array of 8 cgroup storage pointers to accommodate
possible BPF program preemptions and nested executions.

While this seems to work good in practice, it introduces new and unnecessary
failure mode in which not all BPF programs might be executed if we fail to
find an unused slot for cgroup storage, however unlikely it is. It might also
not be so unlikely when/if we allow sleepable cgroup BPF programs in the
future.

Further, the way that cgroup storage is implemented as ambiently-available
property during entire BPF program execution is a convenient way to pass extra
information to BPF program and helpers without requiring user code to pass
around extra arguments explicitly. So it would be good to have a generic
solution that can allow implementing this without arbitrary restrictions.
Ideally, such solution would work for both preemptable and sleepable BPF
programs in exactly the same way.

This patch introduces such solution, bpf_run_ctx. It adds one pointer field
(bpf_ctx) to task_struct. This field is maintained by BPF_PROG_RUN family of
macros in such a way that it always stays valid throughout BPF program
execution. BPF program preemption is handled by remembering previous
current->bpf_ctx value locally while executing nested BPF program and
restoring old value after nested BPF program finishes. This is handled by two
helper functions, bpf_set_run_ctx() and bpf_reset_run_ctx(), which are
supposed to be used before and after BPF program runs, respectively.

Restoring old value of the pointer handles preemption, while bpf_run_ctx
pointer being a property of current task_struct naturally solves this problem
for sleepable BPF programs by "following" BPF program execution as it is
scheduled in and out of CPU. It would even allow CPU migration of BPF
programs, even though it's not currently allowed by BPF infra.

This patch cleans up cgroup local storage handling as a first application. The
design itself is generic, though, with bpf_run_ctx being an empty struct that
is supposed to be embedded into a specific struct for a given BPF program type
(bpf_cg_run_ctx in this case). Follow up patches are planned that will expand
this mechanism for other uses within tracing BPF programs.

To verify that this change doesn't revert the fix to the original cgroup
storage issue, I ran the same repro as in the original report ([0]) and didn't
get any problems. Replacing bpf_reset_run_ctx(old_run_ctx) with
bpf_reset_run_ctx(NULL) triggers the issue pretty quickly (so repro does work).

  [0] https://lore.kernel.org/bpf/YEEvBUiJl2pJkxTd@krava/

Fixes: b910eaaaa4 ("bpf: Fix NULL pointer dereference in bpf_get_local_storage() helper")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210712230615.3525979-1-andrii@kernel.org
2021-07-16 21:15:28 +02:00
Mark Gray
b83d23a2a3 openvswitch: Introduce per-cpu upcall dispatch
The Open vSwitch kernel module uses the upcall mechanism to send
packets from kernel space to user space when it misses in the kernel
space flow table. The upcall sends packets via a Netlink socket.
Currently, a Netlink socket is created for every vport. In this way,
there is a 1:1 mapping between a vport and a Netlink socket.
When a packet is received by a vport, if it needs to be sent to
user space, it is sent via the corresponding Netlink socket.

This mechanism, with various iterations of the corresponding user
space code, has seen some limitations and issues:

* On systems with a large number of vports, there is a correspondingly
large number of Netlink sockets which can limit scaling.
(https://bugzilla.redhat.com/show_bug.cgi?id=1526306)
* Packet reordering on upcalls.
(https://bugzilla.redhat.com/show_bug.cgi?id=1844576)
* A thundering herd issue.
(https://bugzilla.redhat.com/show_bug.cgi?id=1834444)

This patch introduces an alternative, feature-negotiated, upcall
mode using a per-cpu dispatch rather than a per-vport dispatch.

In this mode, the Netlink socket to be used for the upcall is
selected based on the CPU of the thread that is executing the upcall.
In this way, it resolves the issues above as:

a) The number of Netlink sockets scales with the number of CPUs
rather than the number of vports.
b) Ordering per-flow is maintained as packets are distributed to
CPUs based on mechanisms such as RSS and flows are distributed
to a single user space thread.
c) Packets from a flow can only wake up one user space thread.

The corresponding user space code can be found at:
https://mail.openvswitch.org/pipermail/ovs-dev/2021-July/385139.html

Bugzilla: https://bugzilla.redhat.com/1844576
Signed-off-by: Mark Gray <mark.d.gray@redhat.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-16 11:06:33 -07:00
David S. Miller
82a1ffe57e Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says:

====================
pull-request: bpf-next 2021-07-15

The following pull-request contains BPF updates for your *net-next* tree.

We've added 45 non-merge commits during the last 15 day(s) which contain
a total of 52 files changed, 3122 insertions(+), 384 deletions(-).

The main changes are:

1) Introduce bpf timers, from Alexei.

2) Add sockmap support for unix datagram socket, from Cong.

3) Fix potential memleak and UAF in the verifier, from He.

4) Add bpf_get_func_ip helper, from Jiri.

5) Improvements to generic XDP mode, from Kumar.

6) Support for passing xdp_md to XDP programs in bpf_prog_run, from Zvi.
===================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-15 22:40:10 -07:00
Cong Wang
9825d866ce af_unix: Implement unix_dgram_bpf_recvmsg()
We have to implement unix_dgram_bpf_recvmsg() to replace the
original ->recvmsg() to retrieve skmsg from ingress_msg.

AF_UNIX is again special here because the lack of
sk_prot->recvmsg(). I simply add a special case inside
unix_dgram_recvmsg() to call sk->sk_prot->recvmsg() directly.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210704190252.11866-8-xiyou.wangcong@gmail.com
2021-07-15 18:17:50 -07:00
Cong Wang
c63829182c af_unix: Implement ->psock_update_sk_prot()
Now we can implement unix_bpf_update_proto() to update
sk_prot, especially prot->close().

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210704190252.11866-7-xiyou.wangcong@gmail.com
2021-07-15 18:17:50 -07:00
Cong Wang
17edea21b3 sock_map: Relax config dependency to CONFIG_NET
Currently sock_map still has Kconfig dependency on CONFIG_INET,
but there is no actual functional dependency on it after we
introduce ->psock_update_sk_prot().

We have to extend it to CONFIG_NET now as we are going to
support AF_UNIX.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210704190252.11866-2-xiyou.wangcong@gmail.com
2021-07-15 18:17:49 -07:00
Jiri Olsa
9ffd9f3ff7 bpf: Add bpf_get_func_ip helper for kprobe programs
Adding bpf_get_func_ip helper for BPF_PROG_TYPE_KPROBE programs,
so it's now possible to call bpf_get_func_ip from both kprobe and
kretprobe programs.

Taking the caller's address from 'struct kprobe::addr', which is
defined for both kprobe and kretprobe.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Link: https://lore.kernel.org/bpf/20210714094400.396467-5-jolsa@kernel.org
2021-07-15 17:59:09 -07:00
Jiri Olsa
9b99edcae5 bpf: Add bpf_get_func_ip helper for tracing programs
Adding bpf_get_func_ip helper for BPF_PROG_TYPE_TRACING programs,
specifically for all trampoline attach types.

The trampoline's caller IP address is stored in (ctx - 8) address.
so there's no reason to actually call the helper, but rather fixup
the call instruction and return [ctx - 8] value directly.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210714094400.396467-4-jolsa@kernel.org
2021-07-15 17:58:41 -07:00
Jiri Olsa
1e37392ccc bpf: Enable BPF_TRAMP_F_IP_ARG for trampolines with call_get_func_ip
Enabling BPF_TRAMP_F_IP_ARG for trampolines that actually need it.

The BPF_TRAMP_F_IP_ARG adds extra 3 instructions to trampoline code
and is used only by programs with bpf_get_func_ip helper, which is
added in following patch and sets call_get_func_ip bit.

This patch ensures that BPF_TRAMP_F_IP_ARG flag is used only for
trampolines that have programs with call_get_func_ip set.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210714094400.396467-3-jolsa@kernel.org
2021-07-15 17:16:06 -07:00
Jiri Olsa
7e6f3cd89f bpf, x86: Store caller's ip in trampoline stack
Storing caller's ip in trampoline's stack. Trampoline programs
can reach the IP in (ctx - 8) address, so there's no change in
program's arguments interface.

The IP address is takes from [fp + 8], which is return address
from the initial 'call fentry' call to trampoline.

This IP address will be returned via bpf_get_func_ip helper
helper, which is added in following patches.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210714094400.396467-2-jolsa@kernel.org
2021-07-15 17:16:06 -07:00
Alexei Starovoitov
7ddc80a476 bpf: Teach stack depth check about async callbacks.
Teach max stack depth checking algorithm about async callbacks
that don't increase bpf program stack size.
Also add sanity check that bpf_tail_call didn't sneak into async cb.
It's impossible, since PTR_TO_CTX is not available in async cb,
hence the program cannot contain bpf_tail_call(ctx,...);

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20210715005417.78572-10-alexei.starovoitov@gmail.com
2021-07-15 22:31:10 +02:00
Alexei Starovoitov
bfc6bb74e4 bpf: Implement verifier support for validation of async callbacks.
bpf_for_each_map_elem() and bpf_timer_set_callback() helpers are relying on
PTR_TO_FUNC infra in the verifier to validate addresses to subprograms
and pass them into the helpers as function callbacks.
In case of bpf_for_each_map_elem() the callback is invoked synchronously
and the verifier treats it as a normal subprogram call by adding another
bpf_func_state and new frame in __check_func_call().
bpf_timer_set_callback() doesn't invoke the callback directly.
The subprogram will be called asynchronously from bpf_timer_cb().
Teach the verifier to validate such async callbacks as special kind
of jump by pushing verifier state into stack and let pop_stack() process it.

Special care needs to be taken during state pruning.
The call insn doing bpf_timer_set_callback has to be a prune_point.
Otherwise short timer callbacks might not have prune points in front of
bpf_timer_set_callback() which means is_state_visited() will be called
after this call insn is processed in __check_func_call(). Which means that
another async_cb state will be pushed to be walked later and the verifier
will eventually hit BPF_COMPLEXITY_LIMIT_JMP_SEQ limit.
Since push_async_cb() looks like another push_stack() branch the
infinite loop detection will trigger false positive. To recognize
this case mark such states as in_async_callback_fn.
To distinguish infinite loop in async callback vs the same callback called
with different arguments for different map and timer add async_entry_cnt
to bpf_func_state.

Enforce return zero from async callbacks.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20210715005417.78572-9-alexei.starovoitov@gmail.com
2021-07-15 22:31:10 +02:00
Alexei Starovoitov
3e8ce29850 bpf: Prevent pointer mismatch in bpf_timer_init.
bpf_timer_init() arguments are:
1. pointer to a timer (which is embedded in map element).
2. pointer to a map.
Make sure that pointer to a timer actually belongs to that map.

Use map_uid (which is unique id of inner map) to reject:
inner_map1 = bpf_map_lookup_elem(outer_map, key1)
inner_map2 = bpf_map_lookup_elem(outer_map, key2)
if (inner_map1 && inner_map2) {
    timer = bpf_map_lookup_elem(inner_map1);
    if (timer)
        // mismatch would have been allowed
        bpf_timer_init(timer, inner_map2);
}

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20210715005417.78572-6-alexei.starovoitov@gmail.com
2021-07-15 22:31:10 +02:00
Alexei Starovoitov
68134668c1 bpf: Add map side support for bpf timers.
Restrict bpf timers to array, hash (both preallocated and kmalloced), and
lru map types. The per-cpu maps with timers don't make sense, since 'struct
bpf_timer' is a part of map value. bpf timers in per-cpu maps would mean that
the number of timers depends on number of possible cpus and timers would not be
accessible from all cpus. lpm map support can be added in the future.
The timers in inner maps are supported.

The bpf_map_update/delete_elem() helpers and sys_bpf commands cancel and free
bpf_timer in a given map element.

Similar to 'struct bpf_spin_lock' BTF is required and it is used to validate
that map element indeed contains 'struct bpf_timer'.

Make check_and_init_map_value() init both bpf_spin_lock and bpf_timer when
map element data is reused in preallocated htab and lru maps.

Teach copy_map_value() to support both bpf_spin_lock and bpf_timer in a single
map element. There could be one of each, but not more than one. Due to 'one
bpf_timer in one element' restriction do not support timers in global data,
since global data is a map of single element, but from bpf program side it's
seen as many global variables and restriction of single global timer would be
odd. The sys_bpf map_freeze and sys_mmap syscalls are not allowed on maps with
timers, since user space could have corrupted mmap element and crashed the
kernel. The maps with timers cannot be readonly. Due to these restrictions
search for bpf_timer in datasec BTF in case it was placed in the global data to
report clear error.

The previous patch allowed 'struct bpf_timer' as a first field in a map
element only. Relax this restriction.

Refactor lru map to s/bpf_lru_push_free/htab_lru_push_free/ to cancel and free
the timer when lru map deletes an element as a part of it eviction algorithm.

Make sure that bpf program cannot access 'struct bpf_timer' via direct load/store.
The timer operation are done through helpers only.
This is similar to 'struct bpf_spin_lock'.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20210715005417.78572-5-alexei.starovoitov@gmail.com
2021-07-15 22:31:10 +02:00
Alexei Starovoitov
b00628b1c7 bpf: Introduce bpf timers.
Introduce 'struct bpf_timer { __u64 :64; __u64 :64; };' that can be embedded
in hash/array/lru maps as a regular field and helpers to operate on it:

// Initialize the timer.
// First 4 bits of 'flags' specify clockid.
// Only CLOCK_MONOTONIC, CLOCK_REALTIME, CLOCK_BOOTTIME are allowed.
long bpf_timer_init(struct bpf_timer *timer, struct bpf_map *map, int flags);

// Configure the timer to call 'callback_fn' static function.
long bpf_timer_set_callback(struct bpf_timer *timer, void *callback_fn);

// Arm the timer to expire 'nsec' nanoseconds from the current time.
long bpf_timer_start(struct bpf_timer *timer, u64 nsec, u64 flags);

// Cancel the timer and wait for callback_fn to finish if it was running.
long bpf_timer_cancel(struct bpf_timer *timer);

Here is how BPF program might look like:
struct map_elem {
    int counter;
    struct bpf_timer timer;
};

struct {
    __uint(type, BPF_MAP_TYPE_HASH);
    __uint(max_entries, 1000);
    __type(key, int);
    __type(value, struct map_elem);
} hmap SEC(".maps");

static int timer_cb(void *map, int *key, struct map_elem *val);
/* val points to particular map element that contains bpf_timer. */

SEC("fentry/bpf_fentry_test1")
int BPF_PROG(test1, int a)
{
    struct map_elem *val;
    int key = 0;

    val = bpf_map_lookup_elem(&hmap, &key);
    if (val) {
        bpf_timer_init(&val->timer, &hmap, CLOCK_REALTIME);
        bpf_timer_set_callback(&val->timer, timer_cb);
        bpf_timer_start(&val->timer, 1000 /* call timer_cb2 in 1 usec */, 0);
    }
}

This patch adds helper implementations that rely on hrtimers
to call bpf functions as timers expire.
The following patches add necessary safety checks.

Only programs with CAP_BPF are allowed to use bpf_timer.

The amount of timers used by the program is constrained by
the memcg recorded at map creation time.

The bpf_timer_init() helper needs explicit 'map' argument because inner maps
are dynamic and not known at load time. While the bpf_timer_set_callback() is
receiving hidden 'aux->prog' argument supplied by the verifier.

The prog pointer is needed to do refcnting of bpf program to make sure that
program doesn't get freed while the timer is armed. This approach relies on
"user refcnt" scheme used in prog_array that stores bpf programs for
bpf_tail_call. The bpf_timer_set_callback() will increment the prog refcnt which is
paired with bpf_timer_cancel() that will drop the prog refcnt. The
ops->map_release_uref is responsible for cancelling the timers and dropping
prog refcnt when user space reference to a map reaches zero.
This uref approach is done to make sure that Ctrl-C of user space process will
not leave timers running forever unless the user space explicitly pinned a map
that contained timers in bpffs.

bpf_timer_init() and bpf_timer_set_callback() will return -EPERM if map doesn't
have user references (is not held by open file descriptor from user space and
not pinned in bpffs).

The bpf_map_delete_elem() and bpf_map_update_elem() operations cancel
and free the timer if given map element had it allocated.
"bpftool map update" command can be used to cancel timers.

The 'struct bpf_timer' is explicitly __attribute__((aligned(8))) because
'__u64 :64' has 1 byte alignment of 8 byte padding.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20210715005417.78572-4-alexei.starovoitov@gmail.com
2021-07-15 22:31:10 +02:00
Richard Laing
5c2c853159 bus: mhi: pci-generic: configurable network interface MRU
The MRU value used by the MHI MBIM network interface affects
the throughput performance of the interface. Different modem
models use different default MRU sizes based on their bandwidth
capabilities. Large values generally result in higher throughput
for larger packet sizes.

In addition if the MRU used by the MHI device is larger than that
specified in the MHI net device the data is fragmented and needs
to be re-assembled which generates a (single) warning message about
the fragmented packets. Setting the MRU on both ends avoids the
extra processing to re-assemble the packets.

This patch allows the documented MRU for a modem to be automatically
set as the MHI net device MRU avoiding fragmentation and improving
throughput performance.

Signed-off-by: Richard Laing <richard.laing@alliedtelesis.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-15 10:14:44 -07:00
Kuniyuki Iwashima
f170acda7f bpf: Fix a typo of reuseport map in bpf.h.
Fix s/BPF_MAP_TYPE_REUSEPORT_ARRAY/BPF_MAP_TYPE_REUSEPORT_SOCKARRAY/ typo
in bpf.h.

Fixes: 2dbb9b9e6d ("bpf: Introduce BPF_PROG_TYPE_SK_REUSEPORT")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20210714124317.67526-1-kuniyu@amazon.co.jp
2021-07-14 18:12:05 -07:00
Linus Torvalds
8096acd744 Merge tag 'net-5.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski.
 "Including fixes from bpf and netfilter.

  Current release - regressions:

   - sock: fix parameter order in sock_setsockopt()

  Current release - new code bugs:

   - netfilter: nft_last:
       - fix incorrect arithmetic when restoring last used
       - honor NFTA_LAST_SET on restoration

  Previous releases - regressions:

   - udp: properly flush normal packet at GRO time

   - sfc: ensure correct number of XDP queues; don't allow enabling the
     feature if there isn't sufficient resources to Tx from any CPU

   - dsa: sja1105: fix address learning getting disabled on the CPU port

   - mptcp: addresses a rmem accounting issue that could keep packets in
     subflow receive buffers longer than necessary, delaying MPTCP-level
     ACKs

   - ip_tunnel: fix mtu calculation for ETHER tunnel devices

   - do not reuse skbs allocated from skbuff_fclone_cache in the napi
     skb cache, we'd try to return them to the wrong slab cache

   - tcp: consistently disable header prediction for mptcp

  Previous releases - always broken:

   - bpf: fix subprog poke descriptor tracking use-after-free

   - ipv6:
       - allocate enough headroom in ip6_finish_output2() in case
         iptables TEE is used
       - tcp: drop silly ICMPv6 packet too big messages to avoid
         expensive and pointless lookups (which may serve as a DDOS
         vector)
       - make sure fwmark is copied in SYNACK packets
       - fix 'disable_policy' for forwarded packets (align with IPv4)

   - netfilter: conntrack:
       - do not renew entry stuck in tcp SYN_SENT state
       - do not mark RST in the reply direction coming after SYN packet
         for an out-of-sync entry

   - mptcp: cleanly handle error conditions with MP_JOIN and syncookies

   - mptcp: fix double free when rejecting a join due to port mismatch

   - validate lwtstate->data before returning from skb_tunnel_info()

   - tcp: call sk_wmem_schedule before sk_mem_charge in zerocopy path

   - mt76: mt7921: continue to probe driver when fw already downloaded

   - bonding: fix multiple issues with offloading IPsec to (thru?) bond

   - stmmac: ptp: fix issues around Qbv support and setting time back

   - bcmgenet: always clear wake-up based on energy detection

  Misc:

   - sctp: move 198 addresses from unusable to private scope

   - ptp: support virtual clocks and timestamping

   - openvswitch: optimize operation for key comparison"

* tag 'net-5.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (158 commits)
  net: dsa: properly check for the bridge_leave methods in dsa_switch_bridge_leave()
  sfc: add logs explaining XDP_TX/REDIRECT is not available
  sfc: ensure correct number of XDP queues
  sfc: fix lack of XDP TX queues - error XDP TX failed (-22)
  net: fddi: fix UAF in fza_probe
  net: dsa: sja1105: fix address learning getting disabled on the CPU port
  net: ocelot: fix switchdev objects synced for wrong netdev with LAG offload
  net: Use nlmsg_unicast() instead of netlink_unicast()
  octeontx2-pf: Fix uninitialized boolean variable pps
  ipv6: allocate enough headroom in ip6_finish_output2()
  net: hdlc: rename 'mod_init' & 'mod_exit' functions to be module-specific
  net: bridge: multicast: fix MRD advertisement router port marking race
  net: bridge: multicast: fix PIM hello router port marking race
  net: phy: marvell10g: fix differentiation of 88X3310 from 88X3340
  dsa: fix for_each_child.cocci warnings
  virtio_net: check virtqueue_add_sgs() return value
  mptcp: properly account bulk freed memory
  selftests: mptcp: fix case multiple subflows limited by server
  mptcp: avoid processing packet if a subflow reset
  mptcp: fix syncookie process if mptcp can not_accept new subflow
  ...
2021-07-14 09:24:32 -07:00
Christian Brauner
d1d488d813 fs: add vfs_parse_fs_param_source() helper
Add a simple helper that filesystems can use in their parameter parser
to parse the "source" parameter. A few places open-coded this function
and that already caused a bug in the cgroup v1 parser that we fixed.
Let's make it harder to get this wrong by introducing a helper which
performs all necessary checks.

Link: https://syzkaller.appspot.com/bug?id=6312526aba5beae046fdae8f00399f87aab48b12
Cc: Christoph Hellwig <hch@lst.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-14 09:19:06 -07:00
Matthew Wilcox (Oracle)
79789db03f mm: Make copy_huge_page() always available
Rewrite copy_huge_page() and move it into mm/util.c so it's always
available.  Fixes an exposure of uninitialised memory on configurations
with HUGETLB and UFFD enabled and MIGRATION disabled.

Fixes: 8cc5fcbb5b ("mm, hugetlb: fix racy resv_huge_pages underflow on UFFDIO_COPY")
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-12 11:30:56 -07:00
Marek Behún
a5de4be0aa net: phy: marvell10g: fix differentiation of 88X3310 from 88X3340
It seems that we cannot differentiate 88X3310 from 88X3340 by simply
looking at bit 3 of revision ID. This only works on revisions A0 and A1.
On revision B0, this bit is always 1.

Instead use the 3.d00d register for differentiation, since this register
contains information about number of ports on the device.

Fixes: 9885d016ff ("net: phy: marvell10g: add separate structure for 88X3340")
Signed-off-by: Marek Behún <kabel@kernel.org>
Reported-by: Matteo Croce <mcroce@linux.microsoft.com>
Tested-by: Matteo Croce <mcroce@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-11 10:02:33 -07:00
Linus Torvalds
81361b837a Merge tag 'kbuild-v5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild updates from Masahiro Yamada:

 - Increase the -falign-functions alignment for the debug option.

 - Remove ugly libelf checks from the top Makefile.

 - Make the silent build (-s) more silent.

 - Re-compile the kernel if KBUILD_BUILD_TIMESTAMP is specified.

 - Various script cleanups

* tag 'kbuild-v5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (27 commits)
  scripts: add generic syscallnr.sh
  scripts: check duplicated syscall number in syscall table
  sparc: syscalls: use pattern rules to generate syscall headers
  parisc: syscalls: use pattern rules to generate syscall headers
  nds32: add arch/nds32/boot/.gitignore
  kbuild: mkcompile_h: consider timestamp if KBUILD_BUILD_TIMESTAMP is set
  kbuild: modpost: Explicitly warn about unprototyped symbols
  kbuild: remove trailing slashes from $(KBUILD_EXTMOD)
  kconfig.h: explain IS_MODULE(), IS_ENABLED()
  kconfig: constify long_opts
  scripts/setlocalversion: simplify the short version part
  scripts/setlocalversion: factor out 12-chars hash construction
  scripts/setlocalversion: add more comments to -dirty flag detection
  scripts/setlocalversion: remove workaround for old make-kpkg
  scripts/setlocalversion: remove mercurial, svn and git-svn supports
  kbuild: clean up ${quiet} checks in shell scripts
  kbuild: sink stdout from cmd for silent build
  init: use $(call cmd,) for generating include/generated/compile.h
  kbuild: merge scripts/mkmakefile to top Makefile
  sh: move core-y in arch/sh/Makefile to arch/sh/Kbuild
  ...
2021-07-10 11:01:38 -07:00
Linus Torvalds
e98e03d075 Merge tag 's390-5.14-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull more s390 updates from Vasily Gorbik:

 - Fix preempt_count initialization.

 - Rework call_on_stack() macro to add proper type handling and avoid
   possible register corruption.

 - More error prone "register asm" removal and fixes.

 - Fix syscall restarting when multiple signals are coming in. This adds
   minimalistic trampolines to vdso so we can return from signal without
   using the stack which requires pgm check handler hacks when NX is
   enabled.

 - Remove HAVE_IRQ_EXIT_ON_IRQ_STACK since this is no longer true after
   switch to generic entry.

 - Fix protected virtualization secure storage access exception
   handling.

 - Make machine check C handler always enter with DAT enabled and move
   register validation to C code.

 - Fix tinyconfig boot problem by avoiding MONITOR CALL without
   CONFIG_BUG.

 - Increase asm symbols alignment to 16 to make it consistent with
   compilers.

 - Enable concurrent access to the CPU Measurement Counter Facility.

 - Add support for dynamic AP bus size limit and rework ap_dqap to deal
   with messages greater than recv buffer.

* tag 's390-5.14-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (41 commits)
  s390: preempt: Fix preempt_count initialization
  s390/linkage: increase asm symbols alignment to 16
  s390: rename CALL_ON_STACK_NORETURN() to call_on_stack_noreturn()
  s390: add type checking to CALL_ON_STACK_NORETURN() macro
  s390: remove old CALL_ON_STACK() macro
  s390/softirq: use call_on_stack() macro
  s390/lib: use call_on_stack() macro
  s390/smp: use call_on_stack() macro
  s390/kexec: use call_on_stack() macro
  s390/irq: use call_on_stack() macro
  s390/mm: use call_on_stack() macro
  s390: introduce proper type handling call_on_stack() macro
  s390/irq: simplify on_async_stack()
  s390/irq: inline do_softirq_own_stack()
  s390/irq: simplify do_softirq_own_stack()
  s390/ap: get rid of register asm in ap_dqap()
  s390: rename PIF_SYSCALL_RESTART to PIF_EXECVE_PGSTE_RESTART
  s390: move restart of execve() syscall
  s390/signal: remove sigreturn on stack
  s390/signal: switch to using vdso for sigreturn and syscall restart
  ...
2021-07-10 10:46:14 -07:00
Linus Torvalds
071e5aceeb Merge tag 'arm-drivers-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
Pull ARM driver updates from Olof Johansson:

 - Reset controllers: Adding support for Microchip Sparx5 Switch.

 - Memory controllers: ARM Primecell PL35x SMC memory controller driver
   cleanups and improvements.

 - i.MX SoC drivers: Power domain support for i.MX8MM and i.MX8MN.

 - Rockchip: RK3568 power domains support + DT binding updates,
   cleanups.

 - Qualcomm SoC drivers: Amend socinfo with more SoC/PMIC details,
   including support for MSM8226, MDM9607, SM6125 and SC8180X.

 - ARM FFA driver: "Firmware Framework for ARMv8-A", defining management
   interfaces and communication (including bus model) between partitions
   both in Normal and Secure Worlds.

 - Tegra Memory controller changes, including major rework to deal with
   identity mappings at boot and integration with ARM SMMU pieces.

* tag 'arm-drivers-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (120 commits)
  firmware: turris-mox-rwtm: add marvell,armada-3700-rwtm-firmware compatible string
  firmware: turris-mox-rwtm: show message about HWRNG registration
  firmware: turris-mox-rwtm: fail probing when firmware does not support hwrng
  firmware: turris-mox-rwtm: report failures better
  firmware: turris-mox-rwtm: fix reply status decoding function
  soc: imx: gpcv2: add support for i.MX8MN power domains
  dt-bindings: add defines for i.MX8MN power domains
  firmware: tegra: bpmp: Fix Tegra234-only builds
  iommu/arm-smmu: Use Tegra implementation on Tegra186
  iommu/arm-smmu: tegra: Implement SID override programming
  iommu/arm-smmu: tegra: Detect number of instances at runtime
  dt-bindings: arm-smmu: Add Tegra186 compatible string
  firmware: qcom_scm: Add MDM9607 compatible
  soc: qcom: rpmpd: Add MDM9607 RPM Power Domains
  soc: renesas: Add support to read LSI DEVID register of RZ/G2{L,LC} SoC's
  soc: renesas: Add ARCH_R9A07G044 for the new RZ/G2L SoC's
  dt-bindings: soc: rockchip: drop unnecessary #phy-cells from grf.yaml
  memory: emif: remove unused frequency and voltage notifiers
  memory: fsl_ifc: fix leak of private memory on probe failure
  memory: fsl_ifc: fix leak of IO mapping on probe failure
  ...
2021-07-10 09:46:20 -07:00
Linus Torvalds
e083bbd604 Merge tag 'arm-dt-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
Pull ARM devicetree updates from Olof Johansson:
 "Like always, the DT branch is sizable. There are numerous additions
  and fixes to existing platforms, but also a handful of new ones
  introduced. Less than some other releases, but there's been
  significant work on cleanups, refactorings and device enabling on
  existing platforms.

  A non-exhaustive list of new material:

   - Refactoring of BCM2711 dtsi structure to add support for the
     Raspberry Pi 400

   - Rockchip: RK3568 SoC and EVB, video codecs for
     rk3036/3066/3188/322x

   - Qualcomm: SA8155p Automotive platform (SM8150 derivative),
     SM8150/8250 enhancements and support for Sony Xperia 1/1II and
     5/5II

   - TI K3: PCI/USB3 support on AM64-sk boards, R5 remoteproc
     definitions

   - TI OMAP: Various cleanups

   - Tegra: Audio support for Jetson Xavier NX, SMMU support on Tegra194

   - Qualcomm: lots of additions for peripherals across several SoCs,
     and new support for Microsoft Surface Duo (SM8150-based), Huawei
     Ascend G7.

   - i.MX: Numerous additions of features across SoCs and boards.

   - Allwinner: More device bindings for V3s, Forlinx OKA40i-C and
     NanoPi R1S H5 boards

   - MediaTek: More device bindings for mt8167, new Chromebook system
     variants for mt8183

   - Renesas: RZ/G2L SoC and EVK added

   - Amlogic: BananaPi BPI-M5 board added"

* tag 'arm-dt-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (511 commits)
  arm64: dts: rockchip: add basic dts for RK3568 EVB
  arm64: dts: rockchip: add core dtsi for RK3568 SoC
  arm64: dts: rockchip: add generic pinconfig settings used by most Rockchip socs
  ARM: dts: rockchip: add vpu and vdec node for RK322x
  ARM: dts: rockchip: add vpu nodes for RK3066 and RK3188
  ARM: dts: rockchip: add vpu node for RK3036
  arm64: dts: ipq8074: Add QUP6 I2C node
  arm64: dts: rockchip: Re-add regulator-always-on for vcc_sdio for rk3399-roc-pc
  arm64: dts: rockchip: Re-add regulator-boot-on, regulator-always-on for vdd_gpu on rk3399-roc-pc
  arm64: dts: rockchip: add ir-receiver for rk3399-roc-pc
  arm64: dts: rockchip: Add USB-C port details for rk3399 Firefly
  arm64: dts: rockchip: Sort rk3399 firefly pinmux entries
  arm64: dts: rockchip: add infrared receiver node to RK3399 Firefly
  arm64: dts: rockchip: add SPDIF node for rk3399-firefly
  arm64: dts: rockchip: Add Rotation Property for OGA Panel
  arm64: dts: qcom: sc7180: bus votes for eMMC and SD card
  arm64: dts: qcom: sm8250-edo: Add Samsung touchscreen
  arm64: dts: qcom: sm8250-edo: Enable GPI DMA
  arm64: dts: qcom: sm8250-edo: Enable ADSP/CDSP/SLPI
  arm64: dts: qcom: sm8250-edo: Enable PCIe
  ...
2021-07-10 09:33:54 -07:00
Linus Torvalds
6e207b8821 Merge tag 'arm-soc-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
Pull ARM SoC updates from Olof Johansson:
 "A few SoC (code) changes have queued up this cycle, mostly for minor
  changes and some refactoring and cleanup of legacy platforms. This
  branch also contains a few of the fixes that weren't sent in by the
  end of the release (all fairly minor).

   - Adding an additional maintainer for the TEE subsystem (Sumit Garg)

   - Quite a significant modernization of the IXP4xx platforms by Linus
     Walleij, revisiting with a new PCI host driver/binding, removing
     legacy mach/* include dependencies and moving platform
     detection/config to drivers/soc. Also some updates/cleanup of
     platform data.

   - Core power domain support for Tegra platforms, and some
     improvements in build test coverage by adding stubs for compile
     test targets.

   - A handful of updates to i.MX platforms, adding legacy (non-PSCI)
     SMP support on i.MX7D, SoC ID setup for i.MX50, removal of platform
     data and board fixups for iMX6/7.

  ... and a few smaller changes and fixes for Samsung, OMAP, Allwinner,
  Rockchip"

* tag 'arm-soc-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (53 commits)
  MAINTAINERS: Add myself as TEE subsystem reviewer
  ixp4xx: fix spelling mistake in Kconfig "Devce" -> "Device"
  hw_random: ixp4xx: Add OF support
  hw_random: ixp4xx: Add DT bindings
  hw_random: ixp4xx: Turn into a module
  hw_random: ixp4xx: Use SPDX license tag
  hw_random: ixp4xx: enable compile-testing
  pata: ixp4xx: split platform data to its own header
  soc: ixp4xx: move cpu detection to linux/soc/ixp4xx/cpu.h
  PCI: ixp4xx: Add a new driver for IXP4xx
  PCI: ixp4xx: Add device tree bindings for IXP4xx
  ARM/ixp4xx: Make NEED_MACH_IO_H optional
  ARM/ixp4xx: Move the virtual IObases
  MAINTAINERS: ARM/MStar/Sigmastar SoCs: Add a link to the MStar tree
  ARM: debug: add UART early console support for MSTAR SoCs
  ARM: dts: ux500: Fix LED probing
  ARM: imx: add smp support for imx7d
  ARM: imx6q: drop of_platform_default_populate() from init_machine
  arm64: dts: rockchip: Update RK3399 PCI host bridge window to 32-bit address memory
  soc/tegra: fuse: Fix Tegra234-only builds
  ...
2021-07-10 09:22:44 -07:00
Jianguo Wu
6787b7e350 mptcp: avoid processing packet if a subflow reset
If check_fully_established() causes a subflow reset, it should not
continue to process the packet in tcp_data_queue().
Add a return value to mptcp_incoming_options(), and return false if a
subflow has been reset, else return true. Then drop the packet in
tcp_data_queue()/tcp_rcv_state_process() if mptcp_incoming_options()
return false.

Fixes: d582484726 ("mptcp: fix fallback for MP_JOIN subflows")
Signed-off-by: Jianguo Wu <wujianguo@chinatelecom.cn>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-09 18:38:53 -07:00
David S. Miller
5d52c906f0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2021-07-09

The following pull-request contains BPF updates for your *net* tree.

We've added 9 non-merge commits during the last 9 day(s) which contain
a total of 13 files changed, 118 insertions(+), 62 deletions(-).

The main changes are:

1) Fix runqslower task->state access from BPF, from SanjayKumar Jeyakumar.

2) Fix subprog poke descriptor tracking use-after-free, from John Fastabend.

3) Fix sparse complaint from prior devmap RCU conversion, from Toke Høiland-Jørgensen.

4) Fix missing va_end in bpftool JIT json dump's error path, from Gu Shengxian.

5) Fix tools/bpf install target from missing runqslower install, from Wei Li.

6) Fix xdpsock BPF sample to unload program on shared umem option, from Wang Hai.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-09 15:22:45 -07:00
Taehee Yoo
67a9c94317 net: validate lwtstate->data before returning from skb_tunnel_info()
skb_tunnel_info() returns pointer of lwtstate->data as ip_tunnel_info
type without validation. lwtstate->data can have various types such as
mpls_iptunnel_encap, etc and these are not compatible.
So skb_tunnel_info() should validate before returning that pointer.

Splat looks like:
BUG: KASAN: slab-out-of-bounds in vxlan_get_route+0x418/0x4b0 [vxlan]
Read of size 2 at addr ffff888106ec2698 by task ping/811

CPU: 1 PID: 811 Comm: ping Not tainted 5.13.0+ #1195
Call Trace:
 dump_stack_lvl+0x56/0x7b
 print_address_description.constprop.8.cold.13+0x13/0x2ee
 ? vxlan_get_route+0x418/0x4b0 [vxlan]
 ? vxlan_get_route+0x418/0x4b0 [vxlan]
 kasan_report.cold.14+0x83/0xdf
 ? vxlan_get_route+0x418/0x4b0 [vxlan]
 vxlan_get_route+0x418/0x4b0 [vxlan]
 [ ... ]
 vxlan_xmit_one+0x148b/0x32b0 [vxlan]
 [ ... ]
 vxlan_xmit+0x25c5/0x4780 [vxlan]
 [ ... ]
 dev_hard_start_xmit+0x1ae/0x6e0
 __dev_queue_xmit+0x1f39/0x31a0
 [ ... ]
 neigh_xmit+0x2f9/0x940
 mpls_xmit+0x911/0x1600 [mpls_iptunnel]
 lwtunnel_xmit+0x18f/0x450
 ip_finish_output2+0x867/0x2040
 [ ... ]

Fixes: 61adedf3e3 ("route: move lwtunnel state to dst_entry")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-09 13:55:53 -07:00
Linus Torvalds
a022f7d575 Merge tag 'block-5.14-2021-07-08' of git://git.kernel.dk/linux-block
Pull more block updates from Jens Axboe:
 "A combination of changes that ended up depending on both the driver
  and core branch (and/or the IDE removal), and a few late arriving
  fixes. In detail:

   - Fix io ticks wrap-around issue (Chunguang)

   - nvme-tcp sock locking fix (Maurizio)

   - s390-dasd fixes (Kees, Christoph)

   - blk_execute_rq polling support (Keith)

   - blk-cgroup RCU iteration fix (Yu)

   - nbd backend ID addition (Prasanna)

   - Partition deletion fix (Yufen)

   - Use blk_mq_alloc_disk for mmc, mtip32xx, ubd (Christoph)

   - Removal of now dead block request types due to IDE removal
     (Christoph)

   - Loop probing and control device cleanups (Christoph)

   - Device uevent fix (Christoph)

   - Misc cleanups/fixes (Tetsuo, Christoph)"

* tag 'block-5.14-2021-07-08' of git://git.kernel.dk/linux-block: (34 commits)
  blk-cgroup: prevent rcu_sched detected stalls warnings while iterating blkgs
  block: fix the problem of io_ticks becoming smaller
  nvme-tcp: can't set sk_user_data without write_lock
  loop: remove unused variable in loop_set_status()
  block: remove the bdgrab in blk_drop_partitions
  block: grab a device refcount in disk_uevent
  s390/dasd: Avoid field over-reading memcpy()
  dasd: unexport dasd_set_target_state
  block: check disk exist before trying to add partition
  ubd: remove dead code in ubd_setup_common
  nvme: use return value from blk_execute_rq()
  block: return errors from blk_execute_rq()
  nvme: use blk_execute_rq() for passthrough commands
  block: support polling through blk_execute_rq
  block: remove REQ_OP_SCSI_{IN,OUT}
  block: mark blk_mq_init_queue_data static
  loop: rewrite loop_exit using idr_for_each_entry
  loop: split loop_lookup
  loop: don't allow deleting an unspecified loop device
  loop: move loop_ctl_mutex locking into loop_add
  ...
2021-07-09 12:05:33 -07:00
Linus Torvalds
1eb8df1867 Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
Pull virtio,vhost,vdpa updates from Michael Tsirkin:

 - Doorbell remapping for ifcvf, mlx5

 - virtio_vdpa support for mlx5

 - Validate device input in several drivers (for SEV and friends)

 - ZONE_MOVABLE aware handling in virtio-mem

 - Misc fixes, cleanups

* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (48 commits)
  virtio-mem: prioritize unplug from ZONE_MOVABLE in Big Block Mode
  virtio-mem: simplify high-level unplug handling in Big Block Mode
  virtio-mem: prioritize unplug from ZONE_MOVABLE in Sub Block Mode
  virtio-mem: simplify high-level unplug handling in Sub Block Mode
  virtio-mem: simplify high-level plug handling in Sub Block Mode
  virtio-mem: use page_zonenum() in virtio_mem_fake_offline()
  virtio-mem: don't read big block size in Sub Block Mode
  virtio/vdpa: clear the virtqueue state during probe
  vp_vdpa: allow set vq state to initial state after reset
  virtio-pci library: introduce vp_modern_get_driver_features()
  vdpa: support packed virtqueue for set/get_vq_state()
  virtio-ring: store DMA metadata in desc_extra for split virtqueue
  virtio: use err label in __vring_new_virtqueue()
  virtio_ring: introduce virtqueue_desc_add_split()
  virtio_ring: secure handling of mapping errors
  virtio-ring: factor out desc_extra allocation
  virtio_ring: rename vring_desc_extra_packed
  virtio-ring: maintain next in extra state for packed virtqueue
  vdpa/mlx5: Clear vq ready indication upon device reset
  vdpa/mlx5: Add support for doorbell bypassing
  ...
2021-07-09 11:06:29 -07:00
Linus Torvalds
d8dc121eea Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto fixes from Herbert Xu:

 - Regression fix in drbg due to missing self-test for new default
   algorithm

 - Add ratelimit on user-triggerable message in qat

 - Fix build failure due to missing dependency in sl3516

 - Remove obsolete PageSlab checks

 - Fix bogus hardware register writes on Kunpeng920 in hisilicon/sec

* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
  crypto: hisilicon/sec - fix the process of disabling sva prefetching
  crypto: sl3516 - Add dependency on ARCH_GEMINI
  crypto: sl3516 - Typo s/Stormlink/Storlink/
  crypto: drbg - self test for HMAC(SHA-512)
  crypto: omap - Drop obsolete PageSlab check
  crypto: scatterwalk - Remove obsolete PageSlab check
  crypto: qat - ratelimit invalid ioctl message and print the invalid cmd
2021-07-09 11:00:44 -07:00
Linus Torvalds
dcf3c935dd Merge tag 'for-linus-5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml
Pull UML updates from Richard Weinberger:

 - Support for optimized routines based on the host CPU

 - Support for PCI via virtio

 - Various fixes

* tag 'for-linus-5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml:
  um: remove unneeded semicolon in um_arch.c
  um: Remove the repeated declaration
  um: fix error return code in winch_tramp()
  um: fix error return code in slip_open()
  um: Fix stack pointer alignment
  um: implement flush_cache_vmap/flush_cache_vunmap
  um: add a UML specific futex implementation
  um: enable the use of optimized xor routines in UML
  um: Add support for host CPU flags and alignment
  um: allow not setting extra rpaths in the linux binary
  um: virtio/pci: enable suspend/resume
  um: add PCI over virtio emulation driver
  um: irqs: allow invoking time-travel handler multiple times
  um: time-travel/signals: fix ndelay() in interrupt
  um: expose time-travel mode to userspace side
  um: export signals_enabled directly
  um: remove unused smp_sigio_handler() declaration
  lib: add iomem emulation (logic_iomem)
  um: allow disabling NO_IOMEM
2021-07-09 10:19:13 -07:00
Linus Torvalds
e49d68ce7c Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
 "Ext4 regression and bug fixes"

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: inline jbd2_journal_[un]register_shrinker()
  ext4: fix flags validity checking for EXT4_IOC_CHECKPOINT
  ext4: fix possible UAF when remounting r/o a mmp-protected file system
  ext4: use ext4_grp_locked_error in mb_find_extent
  ext4: fix WARN_ON_ONCE(!buffer_uptodate) after an error writing the superblock
  Revert "ext4: consolidate checks for resize of bigalloc into ext4_resize_begin"
2021-07-09 09:57:27 -07:00
Linus Torvalds
96890bc2ea Merge tag 'nfs-for-5.14-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client updates from Trond Myklebust:
 "Highlights include:

  Features:

   - Multiple patches to add support for fcntl() leases over NFSv4.

   - A sysfs interface to display more information about the various
     transport connections used by the RPC client

   - A sysfs interface to allow a suitably privileged user to offline a
     transport that may no longer point to a valid server

   - A sysfs interface to allow a suitably privileged user to change the
     server IP address used by the RPC client

  Stable fixes:

   - Two sunrpc fixes for deadlocks involving privileged rpc_wait_queues

  Bugfixes:

   - SUNRPC: Avoid a KASAN slab-out-of-bounds bug in xdr_set_page_base()

   - SUNRPC: prevent port reuse on transports which don't request it.

   - NFSv3: Fix memory leak in posix_acl_create()

   - NFS: Various fixes to attribute revalidation timeouts

   - NFSv4: Fix handling of non-atomic change attribute updates

   - NFSv4: If a server is down, don't cause mounts to other servers to
     hang as well

   - pNFS: Fix an Oops in pnfs_mark_request_commit() when doing O_DIRECT

   - NFS: Fix mount failures due to incorrect setting of the
     has_sec_mnt_opts filesystem flag

   - NFS: Ensure nfs_readpage returns promptly when an internal error
     occurs

   - NFS: Fix fscache read from NFS after cache error

   - pNFS: Various bugfixes around the LAYOUTGET operation"

* tag 'nfs-for-5.14-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (46 commits)
  NFSv4/pNFS: Return an error if _nfs4_pnfs_v3_ds_connect can't load NFSv3
  NFSv4/pNFS: Don't call _nfs4_pnfs_v3_ds_connect multiple times
  NFSv4/pnfs: Clean up layout get on open
  NFSv4/pnfs: Fix layoutget behaviour after invalidation
  NFSv4/pnfs: Fix the layout barrier update
  NFS: Fix fscache read from NFS after cache error
  NFS: Ensure nfs_readpage returns promptly when internal error occurs
  sunrpc: remove an offlined xprt using sysfs
  sunrpc: provide showing transport's state info in the sysfs directory
  sunrpc: display xprt's queuelen of assigned tasks via sysfs
  sunrpc: provide multipath info in the sysfs directory
  NFSv4.1 identify and mark RPC tasks that can move between transports
  sunrpc: provide transport info in the sysfs directory
  SUNRPC: take a xprt offline using sysfs
  sunrpc: add dst_attr attributes to the sysfs xprt directory
  SUNRPC for TCP display xprt's source port in sysfs xprt_info
  SUNRPC query transport's source port
  SUNRPC display xprt's main value in sysfs's xprt_info
  SUNRPC mark the first transport
  sunrpc: add add sysfs directory per xprt under each xprt_switch
  ...
2021-07-09 09:43:57 -07:00
Linus Torvalds
227c4d507c Merge tag 'f2fs-for-5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs updates from Jaegeuk Kim:
 "In this round, we've improved the compression support especially for
  Android such as allowing compression for mmap files, replacing the
  immutable bit with internal bit to prohibits data writes explicitly,
  and adding a mount option, "compress_cache", to improve random reads.
  And, we added "readonly" feature to compact the partition w/
  compression enabled, which will be useful for Android RO partitions.

  Enhancements:
   - support compression for mmap file
   - use an f2fs flag instead of IMMUTABLE bit for compression
   - support RO feature w/ extent_cache
   - fully support swapfile with file pinning
   - improve atgc tunability
   - add nocompress extensions to unselect files for compression

  Bug fixes:
   - fix false alaram on iget failure during GC
   - fix race condition on global pointers when there are multiple f2fs
     instances
   - add MODULE_SOFTDEP for initramfs

  As usual, we've also cleaned up some places for better code
  readability (e.g., sysfs/feature, debugging messages, slab cache
  name, and docs)"

* tag 'f2fs-for-5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (32 commits)
  f2fs: drop dirty node pages when cp is in error status
  f2fs: initialize page->private when using for our internal use
  f2fs: compress: add nocompress extensions support
  MAINTAINERS: f2fs: update my email address
  f2fs: remove false alarm on iget failure during GC
  f2fs: enable extent cache for compression files in read-only
  f2fs: fix to avoid adding tab before doc section
  f2fs: introduce f2fs_casefolded_name slab cache
  f2fs: swap: support migrating swapfile in aligned write mode
  f2fs: swap: remove dead codes
  f2fs: compress: add compress_inode to cache compressed blocks
  f2fs: clean up /sys/fs/f2fs/<disk>/features
  f2fs: add pin_file in feature list
  f2fs: Advertise encrypted casefolding in sysfs
  f2fs: Show casefolding support only when supported
  f2fs: support RO feature
  f2fs: logging neatening
  f2fs: introduce FI_COMPRESS_RELEASED instead of using IMMUTABLE bit
  f2fs: compress: remove unneeded preallocation
  f2fs: atgc: export entries for better tunability via sysfs
  ...
2021-07-09 09:37:56 -07:00
Linus Torvalds
bd9c350603 Merge branch 'akpm' (patches from Andrew)
Pull yet more updates from Andrew Morton:
 "54 patches.

  Subsystems affected by this patch series: lib, mm (slub, secretmem,
  cleanups, init, pagemap, and mremap), and debug"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (54 commits)
  powerpc/mm: enable HAVE_MOVE_PMD support
  powerpc/book3s64/mm: update flush_tlb_range to flush page walk cache
  mm/mremap: allow arch runtime override
  mm/mremap: hold the rmap lock in write mode when moving page table entries.
  mm/mremap: use pmd/pud_poplulate to update page table entries
  mm/mremap: don't enable optimized PUD move if page table levels is 2
  mm/mremap: convert huge PUD move to separate helper
  selftest/mremap_test: avoid crash with static build
  selftest/mremap_test: update the test to handle pagesize other than 4K
  mm: rename p4d_page_vaddr to p4d_pgtable and make it return pud_t *
  mm: rename pud_page_vaddr to pud_pgtable and make it return pmd_t *
  kdump: use vmlinux_build_id to simplify
  buildid: fix kernel-doc notation
  buildid: mark some arguments const
  scripts/decode_stacktrace.sh: indicate 'auto' can be used for base path
  scripts/decode_stacktrace.sh: silence stderr messages from addr2line/nm
  scripts/decode_stacktrace.sh: support debuginfod
  x86/dumpstack: use %pSb/%pBb for backtrace printing
  arm64: stacktrace: use %pSb for backtrace printing
  module: add printk formats to add module build ID to stacktraces
  ...
2021-07-09 09:29:13 -07:00
John Fastabend
f263a81451 bpf: Track subprog poke descriptors correctly and fix use-after-free
Subprograms are calling map_poke_track(), but on program release there is no
hook to call map_poke_untrack(). However, on program release, the aux memory
(and poke descriptor table) is freed even though we still have a reference to
it in the element list of the map aux data. When we run map_poke_run(), we then
end up accessing free'd memory, triggering KASAN in prog_array_map_poke_run():

  [...]
  [  402.824689] BUG: KASAN: use-after-free in prog_array_map_poke_run+0xc2/0x34e
  [  402.824698] Read of size 4 at addr ffff8881905a7940 by task hubble-fgs/4337
  [  402.824705] CPU: 1 PID: 4337 Comm: hubble-fgs Tainted: G          I       5.12.0+ #399
  [  402.824715] Call Trace:
  [  402.824719]  dump_stack+0x93/0xc2
  [  402.824727]  print_address_description.constprop.0+0x1a/0x140
  [  402.824736]  ? prog_array_map_poke_run+0xc2/0x34e
  [  402.824740]  ? prog_array_map_poke_run+0xc2/0x34e
  [  402.824744]  kasan_report.cold+0x7c/0xd8
  [  402.824752]  ? prog_array_map_poke_run+0xc2/0x34e
  [  402.824757]  prog_array_map_poke_run+0xc2/0x34e
  [  402.824765]  bpf_fd_array_map_update_elem+0x124/0x1a0
  [...]

The elements concerned are walked as follows:

    for (i = 0; i < elem->aux->size_poke_tab; i++) {
           poke = &elem->aux->poke_tab[i];
    [...]

The access to size_poke_tab is a 4 byte read, verified by checking offsets
in the KASAN dump:

  [  402.825004] The buggy address belongs to the object at ffff8881905a7800
                 which belongs to the cache kmalloc-1k of size 1024
  [  402.825008] The buggy address is located 320 bytes inside of
                 1024-byte region [ffff8881905a7800, ffff8881905a7c00)

The pahole output of bpf_prog_aux:

  struct bpf_prog_aux {
    [...]
    /* --- cacheline 5 boundary (320 bytes) --- */
    u32                        size_poke_tab;        /*   320     4 */
    [...]

In general, subprograms do not necessarily manage their own data structures.
For example, BTF func_info and linfo are just pointers to the main program
structure. This allows reference counting and cleanup to be done on the latter
which simplifies their management a bit. The aux->poke_tab struct, however,
did not follow this logic. The initial proposed fix for this use-after-free
bug further embedded poke data tracking into the subprogram with proper
reference counting. However, Daniel and Alexei questioned why we were treating
these objects special; I agree, its unnecessary. The fix here removes the per
subprogram poke table allocation and map tracking and instead simply points
the aux->poke_tab pointer at the main programs poke table. This way, map
tracking is simplified to the main program and we do not need to manage them
per subprogram.

This also means, bpf_prog_free_deferred(), which unwinds the program reference
counting and kfrees objects, needs to ensure that we don't try to double free
the poke_tab when free'ing the subprog structures. This is easily solved by
NULL'ing the poke_tab pointer. The second detail is to ensure that per
subprogram JIT logic only does fixups on poke_tab[] entries it owns. To do
this, we add a pointer in the poke structure to point at the subprogram value
so JITs can easily check while walking the poke_tab structure if the current
entry belongs to the current program. The aux pointer is stable and therefore
suitable for such comparison. On the jit_subprogs() error path, we omit
cleaning up the poke->aux field because these are only ever referenced from
the JIT side, but on error we will never make it to the JIT, so its fine to
leave them dangling. Removing these pointers would complicate the error path
for no reason. However, we do need to untrack all poke descriptors from the
main program as otherwise they could race with the freeing of JIT memory from
the subprograms. Lastly, a748c6975d ("bpf: propagate poke descriptors to
subprograms") had an off-by-one on the subprogram instruction index range
check as it was testing 'insn_idx >= subprog_start && insn_idx <= subprog_end'.
However, subprog_end is the next subprogram's start instruction.

Fixes: a748c6975d ("bpf: propagate poke descriptors to subprograms")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Co-developed-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210707223848.14580-2-john.fastabend@gmail.com
2021-07-09 12:08:27 +02:00
Linus Torvalds
f55966571d Merge tag 'drm-next-2021-07-08-1' of git://anongit.freedesktop.org/drm/drm
Pull drm fixes from Dave Airlie:
 "Some fixes for rc1 that came in the past weeks, mainly a bunch of
  amdgpu fixes, some i915 and the rest are misc around the place. I'm
  sending this a bit early so some more stuff may show up, but I'll
  probably take tomorrow off.

  dma-buf:
   - doc fixes

  amdgpu:
   - Misc Navi fixes
   - Powergating fix
   - Yellow Carp updates
   - Beige Goby updates
   - S0ix fix
   - Revert overlay validation fix
   - GPU reset fix for DC
   - PPC64 fix
   - Add new dimgrey cavefish DID
   - RAS fix
   - TTM fixes

  amdkfd:
   - SVM fixes

  radeon:
   - Fix missing drm_gem_object_put in error path
   - NULL ptr deref fix

  i915:
   - display DP VSC fix
   - DG1 display fix
   - IRQ fixes
   - IRQ demidlayering

  gma500:
   - bo leaks in error paths fixed"

* tag 'drm-next-2021-07-08-1' of git://anongit.freedesktop.org/drm/drm: (52 commits)
  drm/i915: Drop all references to DRM IRQ midlayer
  drm/i915: Use the correct IRQ during resume
  drm/i915/display/dg1: Correctly map DPLLs during state readout
  drm/i915/display: Do not zero past infoframes.vsc
  drm/amdgpu: Conditionally reset SDMA RAS error counts
  drm/amdkfd: Maintain svm_bo reference in page->zone_device_data
  drm/amdkfd: add invalid pages debug at vram migration
  drm/amdkfd: skip migration for pages already in VRAM
  drm/amdkfd: skip invalid pages during migrations
  drm/amdkfd: classify and map mixed svm range pages in GPU
  drm/amdkfd: use hmm range fault to get both domain pfns
  drm/amdgpu: get owner ref in validate and map
  drm/amdkfd: set owner ref to svm range prefault
  drm/amdkfd: add owner ref param to get hmm pages
  drm/amdkfd: device pgmap owner at the svm migrate init
  drm/amdkfd: inc counter on child ranges with xnack off
  drm/amd/display: Extend DMUB diagnostic logging to DCN3.1
  drm/amdgpu: Update NV SIMD-per-CU to 2
  drm/amdgpu: add new dimgrey cavefish DID
  drm/amd/pm: skip PrepareMp1ForUnload message in s0ix
  ...
2021-07-08 12:28:15 -07:00
Linus Torvalds
8c1bfd7460 Merge tag 'pwm/for-5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/thierry.reding/linux-pwm
Pull pwm updates from Thierry Reding:
 "This contains mostly various fixes, cleanups and some conversions to
  the atomic API. One noteworthy change is that PWM consumers can now
  pass a hint to the PWM core about the PWM usage, enabling PWM
  providers to implement various optimizations.

  There's also a fair bit of simplification here with the addition of
  some device-managed helpers as well as unification between the DT and
  ACPI firmware interfaces"

* tag 'pwm/for-5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/thierry.reding/linux-pwm: (50 commits)
  pwm: Remove redundant assignment to pointer pwm
  pwm: ep93xx: Fix read of uninitialized variable ret
  pwm: ep93xx: Prepare clock before using it
  pwm: ep93xx: Unfold legacy callbacks into ep93xx_pwm_apply()
  pwm: ep93xx: Implement .apply callback
  pwm: vt8500: Only unprepare the clock after the pwmchip was removed
  pwm: vt8500: Drop if with an always false condition
  pwm: tegra: Assert reset only after the PWM was unregistered
  pwm: tegra: Don't needlessly enable and disable the clock in .remove()
  pwm: tegra: Don't modify HW state in .remove callback
  pwm: tegra: Drop an if block with an always false condition
  pwm: core: Simplify some devm_*pwm*() functions
  pwm: core: Remove unused devm_pwm_put()
  pwm: core: Unify fwnode checks in the module
  pwm: core: Reuse fwnode_to_pwmchip() in ACPI case
  pwm: core: Convert to use fwnode for matching
  docs: firmware-guide: ACPI: Add a PWM example
  dt-bindings: pwm: pwm-tiecap: Add compatible string for AM64 SoC
  dt-bindings: pwm: pwm-tiecap: Convert to json schema
  pwm: sprd: Don't check the return code of pwmchip_remove()
  ...
2021-07-08 12:18:04 -07:00
Linus Torvalds
b0dfd9af28 Merge tag 'clk-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux
Pull more clk updates from Stephen Boyd:

 - A handful of fixes for lmk04832 driver

 - Migrate the basic clk divider to use determine rate ops

 - Fix modpost build for hisilicon hi3559a driver

 - Actually set the parent in k210_clk_set_parent()

* tag 'clk-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux:
  Revert "clk: divider: Switch from .round_rate to .determine_rate by default"
  clk: hisilicon: hi3559a: Drop __init markings everywhere
  clk: meson: regmap: switch to determine_rate for the dividers
  clk: divider: Switch from .round_rate to .determine_rate by default
  clk: divider: Add re-usable determine_rate implementations
  clk: k210: Fix k210_clk_set_parent()
  clk: lmk04832: Fix spelling mistakes in dev_err messages and comments
  clk: lmk04832: fix return value check in lmk04832_probe()
  clk: stm32mp1: fix missing spin_lock_init()
2021-07-08 12:12:22 -07:00
Linus Torvalds
316a2c9b6a Merge tag 'pci-v5.14-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci
Pull pci updates from Bjorn Helgaas:
 "Enumeration:
   - Fix dsm_label_utf16s_to_utf8s() buffer overrun (Krzysztof
     Wilczyński)
   - Rely on lengths from scnprintf(), dsm_label_utf16s_to_utf8s()
     (Krzysztof Wilczyński)
   - Use sysfs_emit() and sysfs_emit_at() in "show" functions (Krzysztof
     Wilczyński)
   - Fix 'resource_alignment' newline issues (Krzysztof Wilczyński)
   - Add 'devspec' newline (Krzysztof Wilczyński)
   - Dynamically map ECAM regions (Russell King)

  Resource management:
   - Coalesce host bridge contiguous apertures (Kai-Heng Feng)

  PCIe native device hotplug:
   - Ignore Link Down/Up caused by DPC (Lukas Wunner)

  Power management:
   - Leave Apple Thunderbolt controllers on for s2idle or standby
     (Konstantin Kharlamov)

  Virtualization:
   - Work around Huawei Intelligent NIC VF FLR erratum (Chiqijun)
   - Clarify error message for unbound IOV devices (Moritz Fischer)
   - Add pci_reset_bus_function() Secondary Bus Reset interface (Raphael
     Norwitz)

  Peer-to-peer DMA:
   - Simplify distance calculation (Christoph Hellwig)
   - Finish RCU conversion of pdev->p2pdma (Eric Dumazet)
   - Rename upstream_bridge_distance() and rework doc (Logan Gunthorpe)
   - Collect acs list in stack buffer to avoid sleeping (Logan
     Gunthorpe)
   - Use correct calc_map_type_and_dist() return type (Logan Gunthorpe)
   - Warn if host bridge not in whitelist (Logan Gunthorpe)
   - Refactor pci_p2pdma_map_type() (Logan Gunthorpe)
   - Avoid pci_get_slot(), which may sleep (Logan Gunthorpe)

  Altera PCIe controller driver:
   - Add Joyce Ooi as Altera PCIe maintainer (Joyce Ooi)

  Broadcom iProc PCIe controller driver:
   - Fix multi-MSI base vector number allocation (Sandor Bodo-Merle)
   - Support multi-MSI only on uniprocessor kernel (Sandor Bodo-Merle)

  Freescale i.MX6 PCIe controller driver:
   - Limit DBI register length for imx6qp PCIe (Richard Zhu)
   - Add "vph-supply" for PHY supply voltage (Richard Zhu)
   - Enable PHY internal regulator when supplied >3V (Richard Zhu)
   - Remove imx6_pcie_probe() redundant error message (Zhen Lei)

  Intel Gateway PCIe controller driver:
   - Fix INTx enable (Martin Blumenstingl)

  Marvell Aardvark PCIe controller driver:
   - Fix checking for PIO Non-posted Request (Pali Rohár)
   - Implement workaround for the readback value of VEND_ID (Pali Rohár)

  MediaTek PCIe controller driver:
   - Remove redundant error printing in mtk_pcie_subsys_powerup() (Zhen
     Lei)

  MediaTek PCIe Gen3 controller driver:
   - Add missing MODULE_DEVICE_TABLE (Zou Wei)

  Microchip PolarFlare PCIe controller driver:
   - Make struct event_descs static (Krzysztof Wilczyński)

  Microsoft Hyper-V host bridge driver:
   - Fix race condition when removing the device (Long Li)
   - Remove bus device removal unused refcount/functions (Long Li)

  Mobiveil PCIe controller driver:
   - Remove unused readl and writel functions (Krzysztof Wilczyński)

  NVIDIA Tegra PCIe controller driver:
   - Add missing MODULE_DEVICE_TABLE (Zou Wei)

  NVIDIA Tegra194 PCIe controller driver:
   - Fix tegra_pcie_ep_raise_msi_irq() ill-defined shift (Jon Hunter)
   - Fix host initialization during resume (Vidya Sagar)

  Rockchip PCIe controller driver:
   - Register IRQ handlers after device and data are ready (Javier
     Martinez Canillas)"

* tag 'pci-v5.14-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (48 commits)
  PCI/P2PDMA: Finish RCU conversion of pdev->p2pdma
  PCI: xgene: Annotate __iomem pointer
  PCI: Fix kernel-doc formatting
  PCI: cpcihp: Declare cpci_debug in header file
  MAINTAINERS: Add Joyce Ooi as Altera PCIe maintainer
  PCI: rockchip: Register IRQ handlers after device and data are ready
  PCI: tegra194: Fix tegra_pcie_ep_raise_msi_irq() ill-defined shift
  PCI: aardvark: Implement workaround for the readback value of VEND_ID
  PCI: aardvark: Fix checking for PIO Non-posted Request
  PCI: tegra194: Fix host initialization during resume
  PCI: tegra: Add missing MODULE_DEVICE_TABLE
  PCI: imx6: Enable PHY internal regulator when supplied >3V
  dt-bindings: imx6q-pcie: Add "vph-supply" for PHY supply voltage
  PCI: imx6: Limit DBI register length for imx6qp PCIe
  PCI: imx6: Remove imx6_pcie_probe() redundant error message
  PCI: intel-gw: Fix INTx enable
  PCI: iproc: Support multi-MSI only on uniprocessor kernel
  PCI: iproc: Fix multi-MSI base vector number allocation
  PCI: mediatek-gen3: Add missing MODULE_DEVICE_TABLE
  PCI: Dynamically map ECAM regions
  ...
2021-07-08 12:06:20 -07:00
Aneesh Kumar K.V
dc4875f0e7 mm: rename p4d_page_vaddr to p4d_pgtable and make it return pud_t *
No functional change in this patch.

[aneesh.kumar@linux.ibm.com: m68k build error reported by kernel robot]
  Link: https://lkml.kernel.org/r/87tulxnb2v.fsf@linux.ibm.com

Link: https://lkml.kernel.org/r/20210615110859.320299-2-aneesh.kumar@linux.ibm.com
Link: https://lore.kernel.org/linuxppc-dev/CAHk-=wi+J+iodze9FtjM3Zi4j4OeS+qqbKxME9QN4roxPEXH9Q@mail.gmail.com/
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Hugh Dickins <hughd@google.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-08 11:48:22 -07:00
Aneesh Kumar K.V
9cf6fa2458 mm: rename pud_page_vaddr to pud_pgtable and make it return pmd_t *
No functional change in this patch.

[aneesh.kumar@linux.ibm.com: fix]
  Link: https://lkml.kernel.org/r/87wnqtnb60.fsf@linux.ibm.com
[sfr@canb.auug.org.au: another fix]
  Link: https://lkml.kernel.org/r/20210619134410.89559-1-aneesh.kumar@linux.ibm.com

Link: https://lkml.kernel.org/r/20210615110859.320299-1-aneesh.kumar@linux.ibm.com
Link: https://lore.kernel.org/linuxppc-dev/CAHk-=wi+J+iodze9FtjM3Zi4j4OeS+qqbKxME9QN4roxPEXH9Q@mail.gmail.com/
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Hugh Dickins <hughd@google.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-08 11:48:22 -07:00
Stephen Boyd
44e8a5e912 kdump: use vmlinux_build_id to simplify
We can use the vmlinux_build_id array here now instead of open coding it.
This mostly consolidates code.

Link: https://lkml.kernel.org/r/20210511003845.2429846-14-swboyd@chromium.org
Signed-off-by: Stephen Boyd <swboyd@chromium.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Jessica Yu <jeyu@kernel.org>
Cc: Evan Green <evgreen@chromium.org>
Cc: Hsin-Yi Wang <hsinyi@chromium.org>
Cc: Dave Young <dyoung@redhat.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-08 11:48:22 -07:00
Stephen Boyd
9294523e37 module: add printk formats to add module build ID to stacktraces
Let's make kernel stacktraces easier to identify by including the build
ID[1] of a module if the stacktrace is printing a symbol from a module.
This makes it simpler for developers to locate a kernel module's full
debuginfo for a particular stacktrace.  Combined with
scripts/decode_stracktrace.sh, a developer can download the matching
debuginfo from a debuginfod[2] server and find the exact file and line
number for the functions plus offsets in a stacktrace that match the
module.  This is especially useful for pstore crash debugging where the
kernel crashes are recorded in something like console-ramoops and the
recovery kernel/modules are different or the debuginfo doesn't exist on
the device due to space concerns (the debuginfo can be too large for space
limited devices).

Originally, I put this on the %pS format, but that was quickly rejected
given that %pS is used in other places such as ftrace where build IDs
aren't meaningful.  There was some discussions on the list to put every
module build ID into the "Modules linked in:" section of the stacktrace
message but that quickly becomes very hard to read once you have more than
three or four modules linked in.  It also provides too much information
when we don't expect each module to be traversed in a stacktrace.  Having
the build ID for modules that aren't important just makes things messy.
Splitting it to multiple lines for each module quickly explodes the number
of lines printed in an oops too, possibly wrapping the warning off the
console.  And finally, trying to stash away each module used in a
callstack to provide the ID of each symbol printed is cumbersome and would
require changes to each architecture to stash away modules and return
their build IDs once unwinding has completed.

Instead, we opt for the simpler approach of introducing new printk formats
'%pS[R]b' for "pointer symbolic backtrace with module build ID" and '%pBb'
for "pointer backtrace with module build ID" and then updating the few
places in the architecture layer where the stacktrace is printed to use
this new format.

Before:

 Call trace:
  lkdtm_WARNING+0x28/0x30 [lkdtm]
  direct_entry+0x16c/0x1b4 [lkdtm]
  full_proxy_write+0x74/0xa4
  vfs_write+0xec/0x2e8

After:

 Call trace:
  lkdtm_WARNING+0x28/0x30 [lkdtm 6c2215028606bda50de823490723dc4bc5bf46f9]
  direct_entry+0x16c/0x1b4 [lkdtm 6c2215028606bda50de823490723dc4bc5bf46f9]
  full_proxy_write+0x74/0xa4
  vfs_write+0xec/0x2e8

[akpm@linux-foundation.org: fix build with CONFIG_MODULES=n, tweak code layout]
[rdunlap@infradead.org: fix build when CONFIG_MODULES is not set]
  Link: https://lkml.kernel.org/r/20210513171510.20328-1-rdunlap@infradead.org
[akpm@linux-foundation.org: make kallsyms_lookup_buildid() static]
[cuibixuan@huawei.com: fix build error when CONFIG_SYSFS is disabled]
  Link: https://lkml.kernel.org/r/20210525105049.34804-1-cuibixuan@huawei.com

Link: https://lkml.kernel.org/r/20210511003845.2429846-6-swboyd@chromium.org
Link: https://fedoraproject.org/wiki/Releases/FeatureBuildId [1]
Link: https://sourceware.org/elfutils/Debuginfod.html [2]
Signed-off-by: Stephen Boyd <swboyd@chromium.org>
Signed-off-by: Bixuan Cui <cuibixuan@huawei.com>
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Jessica Yu <jeyu@kernel.org>
Cc: Evan Green <evgreen@chromium.org>
Cc: Hsin-Yi Wang <hsinyi@chromium.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-08 11:48:22 -07:00
Stephen Boyd
22f4e66df7 dump_stack: add vmlinux build ID to stack traces
Add the running kernel's build ID[1] to the stacktrace information header.
This makes it simpler for developers to locate the vmlinux with full
debuginfo for a particular kernel stacktrace.  Combined with
scripts/decode_stracktrace.sh, a developer can download the correct
vmlinux from a debuginfod[2] server and find the exact file and line
number for the functions plus offsets in a stacktrace.

This is especially useful for pstore crash debugging where the kernel
crashes are recorded in the pstore logs and the recovery kernel is
different or the debuginfo doesn't exist on the device due to space
concerns (the data can be large and a security concern).  The stacktrace
can be analyzed after the crash by using the build ID to find the matching
vmlinux and understand where in the function something went wrong.

Example stacktrace from lkdtm:

 WARNING: CPU: 4 PID: 3255 at drivers/misc/lkdtm/bugs.c:83 lkdtm_WARNING+0x28/0x30 [lkdtm]
 Modules linked in: lkdtm rfcomm algif_hash algif_skcipher af_alg xt_cgroup uinput xt_MASQUERADE
 CPU: 4 PID: 3255 Comm: bash Not tainted 5.11 #3 aa23f7a1231c229de205662d5a9e0d4c580f19a1
 Hardware name: Google Lazor (rev3+) with KB Backlight (DT)
 pstate: 00400009 (nzcv daif +PAN -UAO -TCO BTYPE=--)
 pc : lkdtm_WARNING+0x28/0x30 [lkdtm]

The hex string aa23f7a1231c229de205662d5a9e0d4c580f19a1 is the build ID,
following the kernel version number. Put it all behind a config option,
STACKTRACE_BUILD_ID, so that kernel developers can remove this
information if they decide it is too much.

Link: https://lkml.kernel.org/r/20210511003845.2429846-5-swboyd@chromium.org
Link: https://fedoraproject.org/wiki/Releases/FeatureBuildId [1]
Link: https://sourceware.org/elfutils/Debuginfod.html [2]
Signed-off-by: Stephen Boyd <swboyd@chromium.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Jessica Yu <jeyu@kernel.org>
Cc: Evan Green <evgreen@chromium.org>
Cc: Hsin-Yi Wang <hsinyi@chromium.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-08 11:48:22 -07:00