Change the proc_create[_data]() stubs which are used when CONFIG_PROC_FS
is not set from #defines to a static inline stubs.
This should fix clang -Werror builds failing due to errors like this:
drivers/platform/x86/thinkpad_acpi.c:918:30: error: unused variable
'dispatch_proc_ops' [-Werror,-Wunused-const-variable]
Fixing this in include/linux/proc_fs.h should ensure that the same issue
is also fixed in any other drivers hitting the same -Werror issue.
[akpm@linux-foundation.org: fix CONFIG_PROC_FS=n]
[akpm@linux-foundation.org: fix arch/sparc/kernel/led.c]
[akpm@linux-foundation.org: fix build]
Link: https://lkml.kernel.org/r/20211116131112.508304-1-hdegoede@redhat.com
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Reported-by: kernel test robot <lkp@intel.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Hans de Goede <hdegoede@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In commit cc5f2704c9 ("proc/vmcore: convert oldmem_pfn_is_ram callback
to more generic vmcore callbacks"), we added detection of surprise
vmcore_cb unregistration after the vmcore was already opened. Once
detected, we warn the user and simulate reading zeroes from that point
on when accessing the vmcore.
The basic reason was that unexpected unregistration, for example, by
manually unbinding a driver from a device after opening the vmcore, is
not supported and could result in reading oldmem the vmcore_cb would
have actually prohibited while registered. However, something like that
can similarly be trigger by a user that's really looking for trouble
simply by unbinding the relevant driver before opening the vmcore -- or
by disallowing loading the driver in the first place. So it's actually
of limited help.
Currently, unregistration can only be triggered via virtio-mem when
manually unbinding the driver from the device inside the VM; there is no
way to trigger it from the hypervisor, as hypervisors don't allow for
unplugging virtio-mem devices -- ripping out system RAM from a VM
without coordination with the guest is usually not a good idea.
The important part is that unbinding the driver and unregistering the
vmcore_cb while concurrently reading the vmcore won't crash the system,
and that is handled by the rwsem.
To make the mechanism more future proof, let's remove the "read zero"
part, but leave the warning in place. For example, we could have a
future driver (like virtio-balloon) that will contact the hypervisor to
figure out if we already populated a page for a given PFN.
Hotunplugging such a device and consequently unregistering the vmcore_cb
could be triggered from the hypervisor without harming the system even
while kdump is running. In that case, we don't want to silently end up
with a vmcore that contains wrong data, because the user inside the VM
might be unaware of the hypervisor action and might easily miss the
warning in the log.
Link: https://lkml.kernel.org/r/20211111192243.22002-1-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Philipp Rudo <prudo@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "mm: percpu: Cleanup percpu first chunk function".
When supporting page mapping percpu first chunk allocator on arm64, we
found there are lots of duplicated codes in percpu embed/page first chunk
allocator. This patchset is aimed to cleanup them and should no function
change.
The currently supported status about 'embed' and 'page' in Archs shows
below,
embed: NEED_PER_CPU_PAGE_FIRST_CHUNK
page: NEED_PER_CPU_EMBED_FIRST_CHUNK
embed page
------------------------
arm64 Y Y
mips Y N
powerpc Y Y
riscv Y N
sparc Y Y
x86 Y Y
------------------------
There are two interfaces about percpu first chunk allocator,
extern int __init pcpu_embed_first_chunk(size_t reserved_size, size_t dyn_size,
size_t atom_size,
pcpu_fc_cpu_distance_fn_t cpu_distance_fn,
- pcpu_fc_alloc_fn_t alloc_fn,
- pcpu_fc_free_fn_t free_fn);
+ pcpu_fc_cpu_to_node_fn_t cpu_to_nd_fn);
extern int __init pcpu_page_first_chunk(size_t reserved_size,
- pcpu_fc_alloc_fn_t alloc_fn,
- pcpu_fc_free_fn_t free_fn,
- pcpu_fc_populate_pte_fn_t populate_pte_fn);
+ pcpu_fc_cpu_to_node_fn_t cpu_to_nd_fn);
The pcpu_fc_alloc_fn_t/pcpu_fc_free_fn_t is killed, we provide generic
pcpu_fc_alloc() and pcpu_fc_free() function, which are called in the
pcpu_embed/page_first_chunk().
1) For pcpu_embed_first_chunk(), pcpu_fc_cpu_to_node_fn_t is needed to be
provided when archs supported NUMA.
2) For pcpu_page_first_chunk(), the pcpu_fc_populate_pte_fn_t is killed too,
a generic pcpu_populate_pte() which marked '__weak' is provided, if you
need a different function to populate pte on the arch(like x86), please
provide its own implementation.
[1] https://github.com/kevin78/linux.git percpu-cleanup
This patch (of 4):
The HAVE_SETUP_PER_CPU_AREA/NEED_PER_CPU_EMBED_FIRST_CHUNK/
NEED_PER_CPU_PAGE_FIRST_CHUNK/USE_PERCPU_NUMA_NODE_ID configs, which have
duplicate definitions on platforms that subscribe it.
Move them into mm, drop these redundant definitions and instead just
select it on applicable platforms.
Link: https://lkml.kernel.org/r/20211216112359.103822-1-wangkefeng.wang@huawei.com
Link: https://lkml.kernel.org/r/20211216112359.103822-2-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com> [arm64]
Cc: Will Deacon <will@kernel.org>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
YiFei Zhu says:
====================
Right now, most cgroup hooks are best used for permission checks. They
can only reject a syscall with -EPERM, so a cause of a rejection, if
the rejected by eBPF cgroup hooks, is ambiguous to userspace.
Additionally, if the syscalls are implemented in eBPF, all permission
checks and the implementation has to happen within the same filter,
as programs executed later in the series of progs are unaware of the
return values return by the previous progs.
This patch series adds two helpers, bpf_get_retval and bpf_set_retval,
that allows hooks to get/set the return value of syscall to userspace.
This also allows later progs to retrieve retval set by previous progs.
For legacy programs that rejects a syscall without setting the retval,
for backwards compatibility, if a prog rejects without itself or a
prior prog setting retval to an -err, the retval is set by the kernel
to -EPERM.
For getsockopt hooks that has ctx->retval, this variable mirrors that
that accessed by the helpers.
Additionally, the following user-visible behavior for getsockopt
hooks has changed:
- If a prior filter rejected the syscall, it will be visible
in ctx->retval.
- Attempting to change the retval arbitrarily is now allowed and
will not cause an -EFAULT.
- If kernel rejects a getsockopt syscall before running the hooks,
the error will be visible in ctx->retval. Returning 0 from the
prog will not overwrite the error to -EPERM unless there is an
explicit call of bpf_set_retval(-EPERM)
Tests have been added in this series to test the behavior of the helper
with cgroup setsockopt getsockopt hooks.
Patch 1 changes the API of macros to prepare for the next patch and
should be a no-op.
Patch 2 moves ctx->retval to a struct pointed to by current
task_struct.
Patch 3 implements the helpers.
Patch 4 tests the behaviors of the helpers.
Patch 5 updates a test after the test broke due to the visible changes.
v1 -> v2:
- errno -> retval
- split one helper to get & set helpers
- allow retval to be set arbitrarily in the general case
- made the helper retval and context retval mirror each other
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The tests would break without this patch, because at one point it calls
getsockopt(fd, SOL_TCP, TCP_ZEROCOPY_RECEIVE, &buf, &optlen)
This getsockopt receives the kernel-set -EINVAL. Prior to this patch
series, the eBPF getsockopt hook's -EPERM would override kernel's
-EINVAL, however, after this patch series, return 0's automatic
-EPERM will not; the eBPF prog has to explicitly bpf_set_retval(-EPERM)
if that is wanted.
I also removed the explicit mentions of EPERM in the comments in the
prog.
Signed-off-by: YiFei Zhu <zhuyifei@google.com>
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/4f20b77cb46812dbc2bdcd7e3fa87c7573bde55e.1639619851.git.zhuyifei@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The tests checks how different ways of interacting with the helpers
(getting retval, setting EUNATCH, EISCONN, and legacy reject
returning 0 without setting retval), produce different results in
both the setsockopt syscall and the retval returned by the helper.
A few more tests verify the interaction between the retval of the
helper and the retval in getsockopt context.
Signed-off-by: YiFei Zhu <zhuyifei@google.com>
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/43ec60d679ae3f4f6fd2460559c28b63cb93cd12.1639619851.git.zhuyifei@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The helpers continue to use int for retval because all the hooks
are int-returning rather than long-returning. The return value of
bpf_set_retval is int for future-proofing, in case in the future
there may be errors trying to set the retval.
After the previous patch, if a program rejects a syscall by
returning 0, an -EPERM will be generated no matter if the retval
is already set to -err. This patch change it being forced only if
retval is not -err. This is because we want to support, for
example, invoking bpf_set_retval(-EINVAL) and return 0, and have
the syscall return value be -EINVAL not -EPERM.
For BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY, the prior behavior is
that, if the return value is NET_XMIT_DROP, the packet is silently
dropped. We preserve this behavior for backward compatibility
reasons, so even if an errno is set, the errno does not return to
caller. However, setting a non-err to retval cannot propagate so
this is not allowed and we return a -EFAULT in that case.
Signed-off-by: YiFei Zhu <zhuyifei@google.com>
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/b4013fd5d16bed0b01977c1fafdeae12e1de61fb.1639619851.git.zhuyifei@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The retval value is moved to struct bpf_cg_run_ctx for ease of access
in different prog types with different context structs layouts. The
helper implementation (to be added in a later patch in the series) can
simply perform a container_of from current->bpf_ctx to retrieve
bpf_cg_run_ctx.
Unfortunately, there is no easy way to access the current task_struct
via the verifier BPF bytecode rewrite, aside from possibly calling a
helper, so a pointer to current task is added to struct bpf_sockopt_kern
so that the rewritten BPF bytecode can access struct bpf_cg_run_ctx with
an indirection.
For backward compatibility, if a getsockopt program rejects a syscall
by returning 0, an -EPERM will be generated, by having the
BPF_PROG_RUN_ARRAY_CG family macros automatically set the retval to
-EPERM. Unlike prior to this patch, this -EPERM will be visible to
ctx->retval for any other hooks down the line in the prog array.
Additionally, the restriction that getsockopt filters can only set
the retval to 0 is removed, considering that certain getsockopt
implementations may return optlen. Filters are now able to set the
value arbitrarily.
Signed-off-by: YiFei Zhu <zhuyifei@google.com>
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/73b0325f5c29912ccea7ea57ec1ed4d388fc1d37.1639619851.git.zhuyifei@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Right now BPF_PROG_RUN_ARRAY and related macros return 1 or 0
for whether the prog array allows or rejects whatever is being
hooked. The caller of these macros then return -EPERM or continue
processing based on thw macro's return value. Unforunately this is
inflexible, since -EPERM is the only err that can be returned.
This patch should be a no-op; it prepares for the next patch. The
returning of the -EPERM is moved to inside the macros, so the outer
functions are directly returning what the macros returned if they
are non-zero.
Signed-off-by: YiFei Zhu <zhuyifei@google.com>
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/788abcdca55886d1f43274c918eaa9f792a9f33b.1639619851.git.zhuyifei@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add a hashmap to map the string offsets from a source btf to the
string offsets from a target btf to reduce overheads.
btf__add_btf() calls btf__add_str() to add strings from a source to a
target btf. It causes many string comparisons, and it is a major
hotspot when adding a big btf. btf__add_str() uses strcmp() to check
if a hash entry is the right one. The extra hashmap here compares
offsets of strings, that are much cheaper. It remembers the results
of btf__add_str() for later uses to reduce the cost.
We are parallelizing BTF encoding for pahole by creating separated btf
instances for worker threads. These per-thread btf instances will be
added to the btf instance of the main thread by calling btf__add_str()
to deduplicate and write out. With this patch and -j4, the running
time of pahole drops to about 6.0s from 6.6s.
The following lines are the summary of 'perf stat' w/o the change.
6.668126396 seconds time elapsed
13.451054000 seconds user
0.715520000 seconds sys
The following lines are the summary w/ the change.
5.986973919 seconds time elapsed
12.939903000 seconds user
0.724152000 seconds sys
V4 fixes a bug of error checking against the pointer returned by
hashmap__new().
[v3] https://lore.kernel.org/bpf/20220118232053.2113139-1-kuifeng@fb.com/
[v2] https://lore.kernel.org/bpf/20220114193713.461349-1-kuifeng@fb.com/
Signed-off-by: Kui-Feng Lee <kuifeng@fb.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220119180214.255634-1-kuifeng@fb.com
Currently the syscalls rst and subsequently man page are auto-generated
using function documentation present in bpf.h. If the documentation for the
syscall is missing or doesn't follow a specific format, then that syscall
is not dumped in the auto-generated rst.
This patch checks the number of syscalls documented within the header file
with those present as part of the enum bpf_cmd and raises an Exception if
they don't match. It is not needed with the currently documented upstream
syscalls, but can help in debugging when developing new syscalls when
there might be missing or misformatted documentation.
The function helper_number_check is moved to the Printer parent
class and renamed to elem_number_check as all the most derived children
classes are using this function now.
Signed-off-by: Usama Arif <usama.arif@bytedance.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Quentin Monnet <quentin@isovalent.com>
Link: https://lore.kernel.org/bpf/20220119114442.1452088-3-usama.arif@bytedance.com
Both description and returns section will become mandatory
for helpers and syscalls in a later commit to generate man pages.
This commit also adds in the documentation that BPF_PROG_RUN is
an alias for BPF_PROG_TEST_RUN for anyone searching for the
syscall in the generated man pages.
Signed-off-by: Usama Arif <usama.arif@bytedance.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220119114442.1452088-1-usama.arif@bytedance.com
`bpftool prog list` and other bpftool subcommands that show
BPF program names currently get them from bpf_prog_info.name.
That field is limited to 16 (BPF_OBJ_NAME_LEN) chars which leads
to truncated names since many progs have much longer names.
The idea of this change is to improve all bpftool commands that
output prog name so that bpftool uses info from BTF to print
program names if available.
It tries bpf_prog_info.name first and fall back to btf only if
the name is suspected to be truncated (has 15 chars length).
Right now `bpftool p show id <id>` returns capped prog name
<id>: kprobe name example_cap_cap tag 712e...
...
With this change it would return
<id>: kprobe name example_cap_capable tag 712e...
...
Note, other commands that print prog names (e.g. "bpftool
cgroup tree") are also addressed in this change.
Signed-off-by: Raman Shukhau <ramasha@fb.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220119100255.1068997-1-ramasha@fb.com
Eric Dumazet says:
====================
ipv4: avoid pathological hash tables
This series speeds up netns dismantles on hosts
having many active netns, by making sure two hash tables
used for IPV4 fib contains uniformly spread items.
v2: changed second patch to add fib_info_laddrhash_bucket()
for consistency (David Ahern suggestion).
====================
Link: https://lore.kernel.org/r/20220119100413.4077866-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net/ipv4/fib_semantics.c uses a hash table (fib_info_laddrhash)
in which fib_sync_down_addr() can locate fib_info
based on IPv4 local address.
This hash table is resized based on total number of
hashed fib_info, but the hash function is only
using the local address.
For hosts having many active network namespaces,
all fib_info for loopback devices (IPv4 address 127.0.0.1)
are hashed into a single bucket, making netns dismantles
very slow.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net/ipv4/fib_semantics.c uses an hash table of 256 slots,
keyed by device ifindexes: fib_info_devhash[DEVINDEX_HASHSIZE]
Problem is that with network namespaces, devices tend
to use the same ifindex.
lo device for instance has a fixed ifindex of one,
for all network namespaces.
This means that hosts with thousands of netns spend
a lot of time looking at some hash buckets with thousands
of elements, notably at netns dismantle.
Simply add a per netns perturbation (net_hash_mix())
to spread elements more uniformely.
Also change fib_devindex_hashfn() to use more entropy.
Fixes: aa79e66eee ("net: Make ifindex generation per-net namespace")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Tobias Waldekranz says:
====================
net/fsl: xgmac_mdio: Add workaround for erratum A-009885
The individual messages mostly speak for themselves.
It is very possible that there are more chips out there that are
impacted by this, but I only have access to the errata document for
the T1024 family, so I've limited the DT changes to the exact FMan
version used in that device. Hopefully someone from NXP can supply a
follow-up if need be.
The final commit is an unrelated fix that was brought to my attention
by sparse.
====================
Link: https://lore.kernel.org/r/20220118215054.2629314-1-tobias@waldekranz.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
As reported by sparse: In the remove path, the driver would attempt to
unmap its own priv pointer - instead of the io memory that it mapped
in probe.
Fixes: 9f35a7342c ("net/fsl: introduce Freescale 10G MDIO driver")
Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This block is used in (at least) T1024 and T1040, including their
variants like T1023 etc.
Fixes: d55ad2967d ("powerpc/mpc85xx: Create dts components for the FSL QorIQ DPAA FMan")
Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Update FMan binding documentation with the newly added workaround for
erratum A-009885.
Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Once an MDIO read transaction is initiated, we must read back the data
register within 16 MDC cycles after the transaction completes. Outside
of this window, reads may return corrupt data.
Therefore, disable local interrupts in the critical section, to
maximize the probability that we can satisfy this requirement.
Fixes: d55ad2967d ("powerpc/mpc85xx: Create dts components for the FSL QorIQ DPAA FMan")
Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Clang static analysis reports this issue
ocelot_flower.c:563:8: warning: 1st function call argument
is an uninitialized value
!is_zero_ether_addr(match.mask->dst)) {
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The variable match is used before it is set. So move the
block.
Fixes: 75944fda1d ("net: mscc: ocelot: offload ingress skbedit and vlan actions to VCAP IS1")
Signed-off-by: Tom Rix <trix@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
On a setup with KSZ9131 and MACB drivers it happens on suspend path, from
time to time, that the PHY interrupt arrives after PHY and MACB were
suspended (PHY via genphy_suspend(), MACB via macb_suspend()). In this
case the phy_read() at the beginning of kszphy_handle_interrupt() will
fail (as MACB driver is suspended at this time) leading to phy_error()
being called and a stack trace being displayed on console. To solve this
.suspend/.resume functions for all KSZ devices implementing
.handle_interrupt were replaced with kszphy_suspend()/kszphy_resume()
which disable/enable interrupt before/after calling
genphy_suspend()/genphy_resume().
The fix has been adapted for all KSZ devices which implements
.handle_interrupt but it has been tested only on KSZ9131.
Fixes: 59ca4e58b9 ("net: phy: micrel: implement generic .handle_interrupt() callback")
Signed-off-by: Claudiu Beznea <claudiu.beznea@microchip.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Both versions of the CPSW driver declare a CPSW_HEADROOM_NA macro that
takes NET_IP_ALIGN into account, but fail to use it appropriately when
storing incoming packets in memory. This results in the IPv4 source and
destination addresses to appear misaligned in memory, which causes
aligment faults that need to be fixed up in software.
So let's switch from CPSW_HEADROOM to CPSW_HEADROOM_NA where needed.
This gets rid of any alignment faults on the RX path on a Beaglebone
White.
Fixes: 9ed4050c0d ("net: ethernet: ti: cpsw: add XDP support")
Cc: Grygorii Strashko <grygorii.strashko@ti.com>
Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Syzbot detected a NULL pointer dereference of nfc_llcp_sock->dev pointer
(which is a 'struct nfc_dev *') with calls to llcp_sock_sendmsg() after
a failed llcp_sock_bind(). The message being sent is a SOCK_DGRAM.
KASAN report:
BUG: KASAN: null-ptr-deref in nfc_alloc_send_skb+0x2d/0xc0
Read of size 4 at addr 00000000000005c8 by task llcp_sock_nfc_a/899
CPU: 5 PID: 899 Comm: llcp_sock_nfc_a Not tainted 5.16.0-rc6-next-20211224-00001-gc6437fbf18b0 #125
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x45/0x59
? nfc_alloc_send_skb+0x2d/0xc0
__kasan_report.cold+0x117/0x11c
? mark_lock+0x480/0x4f0
? nfc_alloc_send_skb+0x2d/0xc0
kasan_report+0x38/0x50
nfc_alloc_send_skb+0x2d/0xc0
nfc_llcp_send_ui_frame+0x18c/0x2a0
? nfc_llcp_send_i_frame+0x230/0x230
? __local_bh_enable_ip+0x86/0xe0
? llcp_sock_connect+0x470/0x470
? llcp_sock_connect+0x470/0x470
sock_sendmsg+0x8e/0xa0
____sys_sendmsg+0x253/0x3f0
...
The issue was visible only with multiple simultaneous calls to bind() and
sendmsg(), which resulted in most of the bind() calls to fail. The
bind() was failing on checking if there is available WKS/SDP/SAP
(respective bit in 'struct nfc_llcp_local' fields). When there was no
available WKS/SDP/SAP, the bind returned error but the sendmsg() to such
socket was able to trigger mentioned NULL pointer dereference of
nfc_llcp_sock->dev.
The code looks simply racy and currently it protects several paths
against race with checks for (!nfc_llcp_sock->local) which is NULL-ified
in error paths of bind(). The llcp_sock_sendmsg() did not have such
check but called function nfc_llcp_send_ui_frame() had, although not
protected with lock_sock().
Therefore the race could look like (same socket is used all the time):
CPU0 CPU1
==== ====
llcp_sock_bind()
- lock_sock()
- success
- release_sock()
- return 0
llcp_sock_sendmsg()
- lock_sock()
- release_sock()
llcp_sock_bind(), same socket
- lock_sock()
- error
- nfc_llcp_send_ui_frame()
- if (!llcp_sock->local)
- llcp_sock->local = NULL
- nfc_put_device(dev)
- dereference llcp_sock->dev
- release_sock()
- return -ERRNO
The nfc_llcp_send_ui_frame() checked llcp_sock->local outside of the
lock, which is racy and ineffective check. Instead, its caller
llcp_sock_sendmsg(), should perform the check inside lock_sock().
Reported-and-tested-by: syzbot+7f23bcddf626e0593a39@syzkaller.appspotmail.com
Fixes: b874dec21d ("NFC: Implement LLCP connection less Tx path")
Cc: <stable@vger.kernel.org>
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Robert Hancock says:
====================
Xilinx axienet fixes
Various fixes for the Xilinx AXI Ethernet driver.
Changed since v2:
-added Reviewed-by tags, added some explanation to commit
messages, no code changes
Changed since v1:
-corrected a Fixes tag to point to mainline commit
-split up reset changes into 3 patches
-added ratelimit on netdev_warn in TX busy case
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
With previous changes to make the driver handle the TX ring size more
correctly, the default TX ring size of 64 appears to significantly
bottleneck TX performance to around 600 Mbps on a 1 Gbps link on ZynqMP.
Increasing this to 128 seems to bring performance up to near line rate and
shouldn't cause excess bufferbloat (this driver doesn't yet support modern
byte-based queue management).
Fixes: 8a3b7a252d ("drivers/net/ethernet/xilinx: added Xilinx AXI Ethernet driver")
Signed-off-by: Robert Hancock <robert.hancock@calian.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Network driver documentation indicates we should be avoiding returning
NETDEV_TX_BUSY from ndo_start_xmit in normal cases, since it requires
the packets to be requeued. Instead the queue should be stopped after
a packet is added to the TX ring when there may not be enough room for an
additional one. Also, when TX ring entries are completed, we should only
wake the queue if we know there is room for another full maximally
fragmented packet.
Print a warning if there is insufficient space at the start of start_xmit,
since this should no longer happen.
Combined with increasing the default TX ring size (in a subsequent
patch), this appears to recover the TX performance lost by previous changes
to actually manage the TX ring state properly.
Fixes: 8a3b7a252d ("drivers/net/ethernet/xilinx: added Xilinx AXI Ethernet driver")
Signed-off-by: Robert Hancock <robert.hancock@calian.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The check for the number of available TX ring slots was off by 1 since a
slot is required for the skb header as well as each fragment. This could
result in overwriting a TX ring slot that was still in use.
Fixes: 8a3b7a252d ("drivers/net/ethernet/xilinx: added Xilinx AXI Ethernet driver")
Signed-off-by: Robert Hancock <robert.hancock@calian.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The check for whether a TX ring slot was available was incorrect,
since a slot which had been loaded with transmit data but the device had
not started transmitting would be treated as available, potentially
causing non-transmitted slots to be overwritten. The control field in
the descriptor should be checked, rather than the status field (which may
only be updated when the device completes the entry).
Fixes: 8a3b7a252d ("drivers/net/ethernet/xilinx: added Xilinx AXI Ethernet driver")
Signed-off-by: Robert Hancock <robert.hancock@calian.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The driver will not work properly if the TX ring size is set to below
MAX_SKB_FRAGS + 1 since it needs to hold at least one full maximally
fragmented packet in the TX ring. Limit setting the ring size to below
this value.
Fixes: 8b09ca823f ("net: axienet: Make RX/TX ring sizes configurable")
Signed-off-by: Robert Hancock <robert.hancock@calian.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This driver was missing some required memory barriers:
Use dma_rmb to ensure we see all updates to the descriptor after we see
that an entry has been completed.
Use wmb and rmb to avoid stale descriptor status between the TX path and
TX complete IRQ path.
Fixes: 8a3b7a252d ("drivers/net/ethernet/xilinx: added Xilinx AXI Ethernet driver")
Signed-off-by: Robert Hancock <robert.hancock@calian.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In some cases where the Xilinx Ethernet core was used in 1000Base-X or
SGMII modes, which use the internal PCS/PMA PHY, and the MGT
transceiver clock source for the PCS was not running at the time the
FPGA logic was loaded, the core would come up in a state where the
PCS could not be found on the MDIO bus. To fix this, the Ethernet core
(including the PCS) should be reset after enabling the clocks, prior to
attempting to access the PCS using of_mdio_find_device.
Fixes: 1a02556086 (net: axienet: Properly handle PCS/PMA PHY for 1000BaseX mode)
Signed-off-by: Robert Hancock <robert.hancock@calian.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
When resetting the device, wait for the PhyRstCmplt bit to be set
in the interrupt status register before continuing initialization, to
ensure that the core is actually ready. When using an external PHY, this
also ensures we do not start trying to access the PHY while it is still
in reset. The PHY reset is initiated by the core reset which is
triggered just above, but remains asserted for 5ms after the core is
reset according to the documentation.
The MgtRdy bit could also be waited for, but unfortunately when using
7-series devices, the bit does not appear to work as documented (it
seems to behave as some sort of link state indication and not just an
indication the transceiver is ready) so it can't really be relied on for
this purpose.
Fixes: 8a3b7a252d ("drivers/net/ethernet/xilinx: added Xilinx AXI Ethernet driver")
Signed-off-by: Robert Hancock <robert.hancock@calian.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
The previous timeout of 1ms was too short to handle some cases where the
core is reset just after the input clocks were started, which will
be introduced in an upcoming patch. Increase the timeout to 50ms. Also
simplify the reset timeout checking to use read_poll_timeout.
Fixes: 8a3b7a252d ("drivers/net/ethernet/xilinx: added Xilinx AXI Ethernet driver")
Signed-off-by: Robert Hancock <robert.hancock@calian.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull f2fs updates from Jaegeuk Kim:
"In this round, we've tried to address some performance issues in
f2fs_checkpoint and direct IO flows. Also, there was a work to enhance
the page cache management used for compression. Other than them, we've
done typical work including sysfs, code clean-ups, tracepoint, sanity
check, in addition to bug fixes on corner cases.
Enhancements:
- use iomap for direct IO
- try to avoid lock contention to improve f2fs_ckpt speed
- avoid unnecessary memory allocation in compression flow
- POSIX_FADV_DONTNEED drops the page cache containing compression
pages
- add some sysfs entries (gc_urgent_high_remaining, pending_discard)
Bug fixes:
- try not to expose unwritten blocks to user by DIO (this was added
to avoid merge conflict; another patch is coming to address other
missing case)
- relax minor error condition for file pinning feature used in
Android OTA
- fix potential deadlock case in compression flow
- should not truncate any block on pinned file
In addition, we've done some code clean-ups and tracepoint/sanity
check improvement"
* tag 'f2fs-for-5.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (29 commits)
f2fs: do not allow partial truncation on pinned file
f2fs: remove redunant invalidate compress pages
f2fs: Simplify bool conversion
f2fs: don't drop compressed page cache in .{invalidate,release}page
f2fs: fix to reserve space for IO align feature
f2fs: fix to check available space of CP area correctly in update_ckpt_flags()
f2fs: support fault injection to f2fs_trylock_op()
f2fs: clean up __find_inline_xattr() with __find_xattr()
f2fs: fix to do sanity check on last xattr entry in __f2fs_setxattr()
f2fs: do not bother checkpoint by f2fs_get_node_info
f2fs: avoid down_write on nat_tree_lock during checkpoint
f2fs: compress: fix potential deadlock of compress file
f2fs: avoid EINVAL by SBI_NEED_FSCK when pinning a file
f2fs: add gc_urgent_high_remaining sysfs node
f2fs: fix to do sanity check in is_alive()
f2fs: fix to avoid panic in is_alive() if metadata is inconsistent
f2fs: fix to do sanity check on inode type during garbage collection
f2fs: avoid duplicate call of mark_inode_dirty
f2fs: show number of pending discard commands
f2fs: support POSIX_FADV_DONTNEED drop compressed page cache
...
Pull tracing fix from Steven Rostedt:
"tracing/scripts: Possible uninitialized variable
The 0day bot discovered a possible uninitialized path in the scripts
that sort the mcount sections at build time. Just needed to initialize
that variable"
* tag 'trace-v5.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
script/sorttable: Fix some initialization problems