Daniel Borkmann says:
====================
pull-request: bpf-next 2022-09-05
The following pull-request contains BPF updates for your *net-next* tree.
We've added 106 non-merge commits during the last 18 day(s) which contain
a total of 159 files changed, 5225 insertions(+), 1358 deletions(-).
There are two small merge conflicts, resolve them as follows:
1) tools/testing/selftests/bpf/DENYLIST.s390x
Commit 27e23836ce ("selftests/bpf: Add lru_bug to s390x deny list") in
bpf tree was needed to get BPF CI green on s390x, but it conflicted with
newly added tests on bpf-next. Resolve by adding both hunks, result:
[...]
lru_bug # prog 'printk': failed to auto-attach: -524
setget_sockopt # attach unexpected error: -524 (trampoline)
cb_refs # expected error message unexpected error: -524 (trampoline)
cgroup_hierarchical_stats # JIT does not support calling kernel function (kfunc)
htab_update # failed to attach: ERROR: strerror_r(-524)=22 (trampoline)
[...]
2) net/core/filter.c
Commit 1227c1771d ("net: Fix data-races around sysctl_[rw]mem_(max|default).")
from net tree conflicts with commit 29003875bd ("bpf: Change bpf_setsockopt(SOL_SOCKET)
to reuse sk_setsockopt()") from bpf-next tree. Take the code as it is from
bpf-next tree, result:
[...]
if (getopt) {
if (optname == SO_BINDTODEVICE)
return -EINVAL;
return sk_getsockopt(sk, SOL_SOCKET, optname,
KERNEL_SOCKPTR(optval),
KERNEL_SOCKPTR(optlen));
}
return sk_setsockopt(sk, SOL_SOCKET, optname,
KERNEL_SOCKPTR(optval), *optlen);
[...]
The main changes are:
1) Add any-context BPF specific memory allocator which is useful in particular for BPF
tracing with bonus of performance equal to full prealloc, from Alexei Starovoitov.
2) Big batch to remove duplicated code from bpf_{get,set}sockopt() helpers as an effort
to reuse the existing core socket code as much as possible, from Martin KaFai Lau.
3) Extend BPF flow dissector for BPF programs to just augment the in-kernel dissector
with custom logic. In other words, allow for partial replacement, from Shmulik Ladkani.
4) Add a new cgroup iterator to BPF with different traversal options, from Hao Luo.
5) Support for BPF to collect hierarchical cgroup statistics efficiently through BPF
integration with the rstat framework, from Yosry Ahmed.
6) Support bpf_{g,s}et_retval() under more BPF cgroup hooks, from Stanislav Fomichev.
7) BPF hash table and local storages fixes under fully preemptible kernel, from Hou Tao.
8) Add various improvements to BPF selftests and libbpf for compilation with gcc BPF
backend, from James Hilliard.
9) Fix verifier helper permissions and reference state management for synchronous
callbacks, from Kumar Kartikeya Dwivedi.
10) Add support for BPF selftest's xskxceiver to also be used against real devices that
support MAC loopback, from Maciej Fijalkowski.
11) Various fixes to the bpf-helpers(7) man page generation script, from Quentin Monnet.
12) Document BPF verifier's tnum_in(tnum_range(), ...) gotchas, from Shung-Hsi Yu.
13) Various minor misc improvements all over the place.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (106 commits)
bpf: Optimize rcu_barrier usage between hash map and bpf_mem_alloc.
bpf: Remove usage of kmem_cache from bpf_mem_cache.
bpf: Remove prealloc-only restriction for sleepable bpf programs.
bpf: Prepare bpf_mem_alloc to be used by sleepable bpf programs.
bpf: Remove tracing program restriction on map types
bpf: Convert percpu hash map to per-cpu bpf_mem_alloc.
bpf: Add percpu allocation support to bpf_mem_alloc.
bpf: Batch call_rcu callbacks instead of SLAB_TYPESAFE_BY_RCU.
bpf: Adjust low/high watermarks in bpf_mem_cache
bpf: Optimize call_rcu in non-preallocated hash map.
bpf: Optimize element count in non-preallocated hash map.
bpf: Relax the requirement to use preallocated hash maps in tracing progs.
samples/bpf: Reduce syscall overhead in map_perf_test.
selftests/bpf: Improve test coverage of test_maps
bpf: Convert hash map to bpf_mem_alloc.
bpf: Introduce any context BPF specific memory allocator.
selftest/bpf: Add test for bpf_getsockopt()
bpf: Change bpf_getsockopt(SOL_IPV6) to reuse do_ipv6_getsockopt()
bpf: Change bpf_getsockopt(SOL_IP) to reuse do_ip_getsockopt()
bpf: Change bpf_getsockopt(SOL_TCP) to reuse do_tcp_getsockopt()
...
====================
Link: https://lore.kernel.org/r/20220905161136.9150-1-daniel@iogearbox.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
User space might be creating and destroying a lot of hash maps. Synchronous
rcu_barrier-s in a destruction path of hash map delay freeing of hash buckets
and other map memory and may cause artificial OOM situation under stress.
Optimize rcu_barrier usage between bpf hash map and bpf_mem_alloc:
- remove rcu_barrier from hash map, since htab doesn't use call_rcu
directly and there are no callback to wait for.
- bpf_mem_alloc has call_rcu_in_progress flag that indicates pending callbacks.
Use it to avoid barriers in fast path.
- When barriers are needed copy bpf_mem_alloc into temp structure
and wait for rcu barrier-s in the worker to let the rest of
hash map freeing to proceed.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220902211058.60789-17-alexei.starovoitov@gmail.com
Extend bpf_mem_alloc to cache free list of fixed size per-cpu allocations.
Once such cache is created bpf_mem_cache_alloc() will return per-cpu objects.
bpf_mem_cache_free() will free them back into global per-cpu pool after
observing RCU grace period.
per-cpu flavor of bpf_mem_alloc is going to be used by per-cpu hash maps.
The free list cache consists of tuples { llist_node, per-cpu pointer }
Unlike alloc_percpu() that returns per-cpu pointer
the bpf_mem_cache_alloc() returns a pointer to per-cpu pointer and
bpf_mem_cache_free() expects to receive it back.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220902211058.60789-11-alexei.starovoitov@gmail.com
Tracing BPF programs can attach to kprobe and fentry. Hence they
run in unknown context where calling plain kmalloc() might not be safe.
Front-end kmalloc() with minimal per-cpu cache of free elements.
Refill this cache asynchronously from irq_work.
BPF programs always run with migration disabled.
It's safe to allocate from cache of the current cpu with irqs disabled.
Free-ing is always done into bucket of the current cpu as well.
irq_work trims extra free elements from buckets with kfree
and refills them with kmalloc, so global kmalloc logic takes care
of freeing objects allocated by one cpu and freed on another.
struct bpf_mem_alloc supports two modes:
- When size != 0 create kmem_cache and bpf_mem_cache for each cpu.
This is typical bpf hash map use case when all elements have equal size.
- When size == 0 allocate 11 bpf_mem_cache-s for each cpu, then rely on
kmalloc/kfree. Max allocation size is 4096 in this case.
This is bpf_dynptr and bpf_kptr use case.
bpf_mem_alloc/bpf_mem_free are bpf specific 'wrappers' of kmalloc/kfree.
bpf_mem_cache_alloc/bpf_mem_cache_free are 'wrappers' of kmem_cache_alloc/kmem_cache_free.
The allocators are NMI-safe from bpf programs only. They are not NMI-safe in general.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220902211058.60789-2-alexei.starovoitov@gmail.com
Add 1000BASE-KX interface mode. This 1G backplane ethernet as described in
clause 70. Clause 73 autonegotiation is mandatory, and only full duplex
operation is supported.
Although at the PMA level this interface mode is identical to
1000BASE-X, it uses a different form of in-band autonegation. This
justifies a separate interface mode, since the interface mode (along
with the MLO_AN_* autonegotiation mode) sets the type of autonegotiation
which will be used on a link. This results in more than just electrical
differences between the link modes.
With regard to 1000BASE-X, 1000BASE-KX holds a similar position to
SGMII: same signaling, but different autonegotiation. PCS drivers
(which typically handle in-band autonegotiation) may only support
1000BASE-X, and not 1000BASE-KX. Similarly, the phy mode is used to
configure serdes phys with phy_set_mode_ext. Due to the different
electrical standards (SFI or XFI vs Clause 70), they will likely want to
use different configuration. Adding a phy interface mode for
1000BASE-KX helps simplify configuration in these areas.
Signed-off-by: Sean Anderson <sean.anderson@seco.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This adds a function to update a CGR with new parameters. qman_create_cgr
can almost be used for this (with flags=0), but it's not suitable because
it also registers the callback function. The _safe variant was modeled off
of qman_cgr_delete_safe. However, we handle multiple arguments and a return
value.
Signed-off-by: Sean Anderson <sean.anderson@seco.com>
Acked-by: Camelia Groza <camelia.groza@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The Altera Triple Speed Ethernet has a SGMII/1000BaseC PCS that can be
integrated in several ways. It can either be part of the TSE MAC's
address space, accessed through 32 bits accesses on the mapped mdio
device 0, or through a dedicated 16 bits register set.
This driver allows using the TSE PCS outside of altera TSE's driver,
since it can be used standalone by other MACs.
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Johannes Berg says:
====================
drivers
- rtw89: large update across the map, e.g. coex, pci(e), etc.
- ath9k: uninit memory read fix
- ath10k: small peer map fix and a WCN3990 device fix
- wfx: underflow
stack
- the "change MAC address while IFF_UP" change from James
we discussed
- more MLO work, including a set of fixes for the previous
code, now that we have more code we can exercise it more
- prevent some features with MLO that aren't ready yet
(AP_VLAN and 4-address connections)
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Add an MLD address attribute to BSS entries that the interface
is currently associated with to help userspace figure out what's
going on.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Add a new extended feature bit signifying that the wireless hardware
supports changing the MAC address while the underlying net_device is
powered. Note that this has a different meaning from
IFF_LIVE_ADDR_CHANGE as additional restrictions might be imposed by
the hardware, such as:
- No connection is active on this interface, carrier is off
- No scan is in progress
- No offchannel operations are in progress
Signed-off-by: James Prestwood <prestwoj@gmail.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
We now have a cleaner way to keep compatibility with user-space
(a.k.a. not breaking it) when we need to keep in place a one-element
array (for its use in user-space) together with a flexible-array
member (for its use in kernel-space) without making it hard to read
at the source level. This is through the use of the new
__DECLARE_FLEX_ARRAY() helper macro.
The size and memory layout of the structure is preserved after the
changes. See below.
Before changes:
$ pahole -C ip_msfilter net/ipv4/igmp.o
struct ip_msfilter {
union {
struct {
__be32 imsf_multiaddr_aux; /* 0 4 */
__be32 imsf_interface_aux; /* 4 4 */
__u32 imsf_fmode_aux; /* 8 4 */
__u32 imsf_numsrc_aux; /* 12 4 */
__be32 imsf_slist[1]; /* 16 4 */
}; /* 0 20 */
struct {
__be32 imsf_multiaddr; /* 0 4 */
__be32 imsf_interface; /* 4 4 */
__u32 imsf_fmode; /* 8 4 */
__u32 imsf_numsrc; /* 12 4 */
__be32 imsf_slist_flex[0]; /* 16 0 */
}; /* 0 16 */
}; /* 0 20 */
/* size: 20, cachelines: 1, members: 1 */
/* last cacheline: 20 bytes */
};
After changes:
$ pahole -C ip_msfilter net/ipv4/igmp.o
struct ip_msfilter {
__be32 imsf_multiaddr; /* 0 4 */
__be32 imsf_interface; /* 4 4 */
__u32 imsf_fmode; /* 8 4 */
__u32 imsf_numsrc; /* 12 4 */
union {
__be32 imsf_slist[1]; /* 16 4 */
struct {
struct {
} __empty_imsf_slist_flex; /* 16 0 */
__be32 imsf_slist_flex[0]; /* 16 0 */
}; /* 16 0 */
}; /* 16 4 */
/* size: 20, cachelines: 1, members: 5 */
/* last cacheline: 20 bytes */
};
In the past, we had to duplicate the whole original structure within
a union, and update the names of all the members. Now, we just need to
declare the flexible-array member to be used in kernel-space through
the __DECLARE_FLEX_ARRAY() helper together with the one-element array,
within a union. This makes the source code more clean and easier to read.
Link: https://github.com/KSPP/linux/issues/193
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch changes bpf_getsockopt(SOL_IPV6) to reuse
do_ipv6_getsockopt(). It removes the duplicated code from
bpf_getsockopt(SOL_IPV6).
This also makes bpf_getsockopt(SOL_IPV6) supporting the same
set of optnames as in bpf_setsockopt(SOL_IPV6). In particular,
this adds IPV6_AUTOFLOWLABEL support to bpf_getsockopt(SOL_IPV6).
ipv6 could be compiled as a module. Like how other code solved it
with stubs in ipv6_stubs.h, this patch adds the do_ipv6_getsockopt
to the ipv6_bpf_stub.
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20220902002931.2896218-1-kafai@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This patch changes bpf_getsockopt(SOL_TCP) to reuse
do_tcp_getsockopt(). It removes the duplicated code from
bpf_getsockopt(SOL_TCP).
Before this patch, there were some optnames available to
bpf_setsockopt(SOL_TCP) but missing in bpf_getsockopt(SOL_TCP).
For example, TCP_NODELAY, TCP_MAXSEG, TCP_KEEPIDLE, TCP_KEEPINTVL,
and a few more. It surprises users from time to time. This patch
automatically closes this gap without duplicating more code.
bpf_getsockopt(TCP_SAVED_SYN) does not free the saved_syn,
so it stays in sol_tcp_sockopt().
For string name value like TCP_CONGESTION, bpf expects it
is always null terminated, so sol_tcp_sockopt() decrements
optlen by one before calling do_tcp_getsockopt() and
the 'if (optlen < saved_optlen) memset(..,0,..);'
in __bpf_getsockopt() will always do a null termination.
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20220902002918.2894511-1-kafai@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This patch changes bpf_getsockopt(SOL_SOCKET) to reuse
sk_getsockopt(). It removes all duplicated code from
bpf_getsockopt(SOL_SOCKET).
Before this patch, there were some optnames available to
bpf_setsockopt(SOL_SOCKET) but missing in bpf_getsockopt(SOL_SOCKET).
It surprises users from time to time. For example, SO_REUSEADDR,
SO_KEEPALIVE, SO_RCVLOWAT, and SO_MAX_PACING_RATE. This patch
automatically closes this gap without duplicating more code.
The only exception is SO_BINDTODEVICE because it needs to acquire a
blocking lock. Thus, SO_BINDTODEVICE is not supported.
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20220902002912.2894040-1-kafai@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Similar to the earlier patch that changes sk_getsockopt() to
take the sockptr_t argument . This patch also changes
do_ipv6_getsockopt() to take the sockptr_t argument such that
a latter patch can make bpf_getsockopt(SOL_IPV6) to reuse
do_ipv6_getsockopt().
Note on the change in ip6_mc_msfget(). This function is to
return an array of sockaddr_storage in optval. This function
is shared between ipv6_get_msfilter() and compat_ipv6_get_msfilter().
However, the sockaddr_storage is stored at different offset of the
optval because of the difference between group_filter and
compat_group_filter. Thus, a new 'ss_offset' argument is
added to ip6_mc_msfget().
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20220902002853.2892532-1-kafai@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Similar to the earlier patch that changes sk_getsockopt() to
take the sockptr_t argument. This patch also changes
do_ip_getsockopt() to take the sockptr_t argument such that
a latter patch can make bpf_getsockopt(SOL_IP) to reuse
do_ip_getsockopt().
Note on the change in ip_mc_gsfget(). This function is to
return an array of sockaddr_storage in optval. This function
is shared between ip_get_mcast_msfilter() and
compat_ip_get_mcast_msfilter(). However, the sockaddr_storage
is stored at different offset of the optval because of
the difference between group_filter and compat_group_filter.
Thus, a new 'ss_offset' argument is added to ip_mc_gsfget().
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20220902002828.2890585-1-kafai@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This patch changes sk_getsockopt() to take the sockptr_t argument
such that it can be used by bpf_getsockopt(SOL_SOCKET) in a
latter patch.
security_socket_getpeersec_stream() is not changed. It stays
with the __user ptr (optval.user and optlen.user) to avoid changes
to other security hooks. bpf_getsockopt(SOL_SOCKET) also does not
support SO_PEERSEC.
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20220902002802.2888419-1-kafai@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
When CONFIG_IEEE802154_NL802154_EXPERIMENTAL is disabled,
NL802154_CMD_DEL_SEC_LEVEL is undefined and results in a compilation
error:
net/ieee802154/nl802154.c:2503:19: error: 'NL802154_CMD_DEL_SEC_LEVEL' undeclared here (not in a function); did you mean 'NL802154_CMD_SET_CCA_ED_LEVEL'?
2503 | .resv_start_op = NL802154_CMD_DEL_SEC_LEVEL + 1,
| ^~~~~~~~~~~~~~~~~~~~~~~~~~
| NL802154_CMD_SET_CCA_ED_LEVEL
Unhide the experimental commands, having them defined in an enum
makes no difference.
Fixes: 9c5d03d362 ("genetlink: start to validate reserved header bytes")
Signed-off-by: Gal Pressman <gal@nvidia.com>
Acked-by: Stefan Schmidt <stefan@datenfreihafen.org>
Tested-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com>
Link: https://lore.kernel.org/r/20220902030620.2737091-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Existing 'bpf_skb_get_tunnel_key' extracts various tunnel parameters
(id, ttl, tos, local and remote) but does not expose ip_tunnel_info's
tun_flags to the BPF program.
It makes sense to expose tun_flags to the BPF program.
Assume for example multiple GRE tunnels maintained on a single GRE
interface in collect_md mode. The program expects origins to initiate
over GRE, however different origins use different GRE characteristics
(e.g. some prefer to use GRE checksum, some do not; some pass a GRE key,
some do not, etc..).
A BPF program getting tun_flags can therefore remember the relevant
flags (e.g. TUNNEL_CSUM, TUNNEL_SEQ...) for each initiating remote. In
the reply path, the program can use 'bpf_skb_set_tunnel_key' in order
to correctly reply to the remote, using similar characteristics, based
on the stored tunnel flags.
Introduce BPF_F_TUNINFO_FLAGS flag for bpf_skb_get_tunnel_key. If
specified, 'bpf_tunnel_key->tunnel_flags' is set with the tun_flags.
Decided to use the existing unused 'tunnel_ext' as the storage for the
'tunnel_flags' in order to avoid changing bpf_tunnel_key's layout.
Also, the following has been considered during the design:
1. Convert the "interesting" internal TUNNEL_xxx flags back to BPF_F_yyy
and place into the new 'tunnel_flags' field. This has 2 drawbacks:
- The BPF_F_yyy flags are from *set_tunnel_key* enumeration space,
e.g. BPF_F_ZERO_CSUM_TX. It is awkward that it is "returned" into
tunnel_flags from a *get_tunnel_key* call.
- Not all "interesting" TUNNEL_xxx flags can be mapped to existing
BPF_F_yyy flags, and it doesn't make sense to create new BPF_F_yyy
flags just for purposes of the returned tunnel_flags.
2. Place key.tun_flags into 'tunnel_flags' but mask them, keeping only
"interesting" flags. That's ok, but the drawback is that what's
"interesting" for my usecase might be limiting for other usecases.
Therefore I decided to expose what's in key.tun_flags *as is*, which seems
most flexible. The BPF user can just choose to ignore bits he's not
interested in. The TUNNEL_xxx are also UAPI, so no harm exposing them
back in the get_tunnel_key call.
Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220831144010.174110-1-shmulik.ladkani@gmail.com
Add some documentation for netdev_tx_sent_queue() and
netdev_tx_completed_queue()
Stating that netdev_tx_completed_queue() must be called once
per TX completion round is apparently not obvious for everybody.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from Paolo Abeni:
"Including fixes from bluetooth, bpf and wireless.
Current release - regressions:
- bpf:
- fix wrong last sg check in sk_msg_recvmsg()
- fix kernel BUG in purge_effective_progs()
- mac80211:
- fix possible leak in ieee80211_tx_control_port()
- potential NULL dereference in ieee80211_tx_control_port()
Current release - new code bugs:
- nfp: fix the access to management firmware hanging
Previous releases - regressions:
- ip: fix triggering of 'icmp redirect'
- sched: tbf: don't call qdisc_put() while holding tree lock
- bpf: fix corrupted packets for XDP_SHARED_UMEM
- bluetooth: hci_sync: fix suspend performance regression
- micrel: fix probe failure
Previous releases - always broken:
- tcp: make global challenge ack rate limitation per net-ns and
default disabled
- tg3: fix potential hang-up on system reboot
- mac802154: fix reception for no-daddr packets
Misc:
- r8152: add PID for the lenovo onelink+ dock"
* tag 'net-6.0-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (56 commits)
net/smc: Remove redundant refcount increase
Revert "sch_cake: Return __NET_XMIT_STOLEN when consuming enqueued skb"
tcp: make global challenge ack rate limitation per net-ns and default disabled
tcp: annotate data-race around challenge_timestamp
net: dsa: hellcreek: Print warning only once
ip: fix triggering of 'icmp redirect'
sch_cake: Return __NET_XMIT_STOLEN when consuming enqueued skb
selftests: net: sort .gitignore file
Documentation: networking: correct possessive "its"
kcm: fix strp_init() order and cleanup
mlxbf_gige: compute MDIO period based on i1clk
ethernet: rocker: fix sleep in atomic context bug in neigh_timer_handler
net: lan966x: improve error handle in lan966x_fdma_rx_get_frame()
nfp: fix the access to management firmware hanging
net: phy: micrel: Make the GPIO to be non-exclusive
net: virtio_net: fix notification coalescing comments
net/sched: fix netdevice reference leaks in attach_default_qdiscs()
net: sched: tbf: don't call qdisc_put() while holding tree lock
net: Use u64_stats_fetch_begin_irq() for stats fetch.
net: dsa: xrs700x: Use irqsave variant for u64 stats update
...
Because per host rate limiting has been proven problematic (side channel
attacks can be based on it), per host rate limiting of challenge acks ideally
should be per netns and turned off by default.
This is a long due followup of following commits:
083ae30828 ("tcp: enable per-socket rate limiting of all 'challenge acks'")
f2b2c582e8 ("tcp: mitigate ACK loops for connections as tcp_sock")
75ff39ccc1 ("tcp: make challenge acks less predictable")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jason Baron <jbaron@akamai.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The variable "other" in the struct red_stats is not used. Remove it.
Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
anon_vma->degree tracks the combined number of child anon_vmas and VMAs
that use the anon_vma as their ->anon_vma.
anon_vma_clone() then assumes that for any anon_vma attached to
src->anon_vma_chain other than src->anon_vma, it is impossible for it to
be a leaf node of the VMA tree, meaning that for such VMAs ->degree is
elevated by 1 because of a child anon_vma, meaning that if ->degree
equals 1 there are no VMAs that use the anon_vma as their ->anon_vma.
This assumption is wrong because the ->degree optimization leads to leaf
nodes being abandoned on anon_vma_clone() - an existing anon_vma is
reused and no new parent-child relationship is created. So it is
possible to reuse an anon_vma for one VMA while it is still tied to
another VMA.
This is an issue because is_mergeable_anon_vma() and its callers assume
that if two VMAs have the same ->anon_vma, the list of anon_vmas
attached to the VMAs is guaranteed to be the same. When this assumption
is violated, vma_merge() can merge pages into a VMA that is not attached
to the corresponding anon_vma, leading to dangling page->mapping
pointers that will be dereferenced during rmap walks.
Fix it by separately tracking the number of child anon_vmas and the
number of VMAs using the anon_vma as their ->anon_vma.
Fixes: 7a3ef208e6 ("mm: prevent endless growth of anon_vma hierarchy")
Cc: stable@kernel.org
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull fscache/cachefiles fixes from David Howells:
- Fix kdoc on fscache_use/unuse_cookie().
- Fix the error returned by cachefiles_ondemand_copen() from an upcall
result.
- Fix the distribution of requests in on-demand mode in cachefiles to
be fairer by cycling through them rather than picking the one with
the lowest ID each time (IDs being reused).
* tag 'fscache-fixes-20220831' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
cachefiles: make on-demand request distribution fairer
cachefiles: fix error return code in cachefiles_ondemand_copen()
fscache: fix misdocumented parameter
Pull LSM support for IORING_OP_URING_CMD from Paul Moore:
"Add SELinux and Smack controls to the io_uring IORING_OP_URING_CMD.
These are necessary as without them the IORING_OP_URING_CMD remains
outside the purview of the LSMs (Luis' LSM patch, Casey's Smack patch,
and my SELinux patch). They have been discussed at length with the
io_uring folks, and Jens has given his thumbs-up on the relevant
patches (see the commit descriptions).
There is one patch that is not strictly necessary, but it makes
testing much easier and is very trivial: the /dev/null
IORING_OP_URING_CMD patch."
* tag 'lsm-pr-20220829' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm:
Smack: Provide read control for io_uring_cmd
/dev/null: add IORING_OP_URING_CMD support
selinux: implement the security_uring_cmd() LSM hook
lsm,io_uring: add LSM hooks for the new uring_cmd file op
Following what we do for routers already, extend this to XDomain
connections as well. This will show in sysfs whether the link is in USB4
or Thunderbolt mode.
Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Follow-up the removal of unused internal api of port params made by
commit 42ded61aa7 ("devlink: Delete not used port parameters APIs")
and stub the commands and add extack message to tell the user what is
going on.
If later on port params are needed, could be easily re-introduced,
but until then it is a dead code.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20220826082730.1399735-1-jiri@resnulli.us
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Being able to check attribute presence and set extack
if not on one line is handy, add helpers.
Reviewed-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
There is currently no way to report via extack in a structured way
that an attribute is missing. This leads to families resorting to
string messages.
Add a pair of attributes - @offset and @type for machine-readable
way of reporting missing attributes. The @offset points to the
nest which should have contained the attribute, @type is the
expected nla_type. The offset will be skipped if the attribute
is missing at the message level rather than inside a nest.
User space should be able to figure out which attribute enum
(AKA attribute space AKA attribute set) the nest pointed to by
@offset is using.
Reviewed-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
IP multicast must sometimes be discriminated from non-IP multicast,
e.g. when determining the forwarding behavior of a given group in the
presence of multicast router ports on an offloaded bridge. Therefore,
provide helpers to identify these groups.
Signed-off-by: Casper Andersson <casper.casan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We had historically not checked that genlmsghdr.reserved
is 0 on input which prevents us from using those precious
bytes in the future.
One use case would be to extend the cmd field, which is
currently just 8 bits wide and 256 is not a lot of commands
for some core families.
To make sure that new families do the right thing by default
put the onus of opting out of validation on existing families.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Paul Moore <paul@paul-moore.com> (NetLabel)
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull more hotfixes from Andrew Morton:
"Seventeen hotfixes. Mostly memory management things.
Ten patches are cc:stable, addressing pre-6.0 issues"
* tag 'mm-hotfixes-stable-2022-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
.mailmap: update Luca Ceresoli's e-mail address
mm/mprotect: only reference swap pfn page if type match
squashfs: don't call kmalloc in decompressors
mm/damon/dbgfs: avoid duplicate context directory creation
mailmap: update email address for Colin King
asm-generic: sections: refactor memory_intersects
bootmem: remove the vmemmap pages from kmemleak in put_page_bootmem
ocfs2: fix freeing uninitialized resource on ocfs2_dlm_shutdown
Revert "memcg: cleanup racy sum avoidance code"
mm/zsmalloc: do not attempt to free IS_ERR handle
binder_alloc: add missing mmap_lock calls when using the VMA
mm: re-allow pinning of zero pfns (again)
vmcoreinfo: add kallsyms_num_syms symbol
mailmap: update Guilherme G. Piccoli's email addresses
writeback: avoid use-after-free after removing device
shmem: update folio if shmem_replace_page() updates the page
mm/hugetlb: avoid corrupting page->mapping in hugetlb_mcopy_atomic_pte
There are two problems with the current code of memory_intersects:
First, it doesn't check whether the region (begin, end) falls inside the
region (virt, vend), that is (virt < begin && vend > end).
The second problem is if vend is equal to begin, it will return true but
this is wrong since vend (virt + size) is not the last address of the
memory region but (virt + size -1) is. The wrong determination will
trigger the misreporting when the function check_for_illegal_area calls
memory_intersects to check if the dma region intersects with stext region.
The misreporting is as below (stext is at 0x80100000):
WARNING: CPU: 0 PID: 77 at kernel/dma/debug.c:1073 check_for_illegal_area+0x130/0x168
DMA-API: chipidea-usb2 e0002000.usb: device driver maps memory from kernel text or rodata [addr=800f0000] [len=65536]
Modules linked in:
CPU: 1 PID: 77 Comm: usb-storage Not tainted 5.19.0-yocto-standard #5
Hardware name: Xilinx Zynq Platform
unwind_backtrace from show_stack+0x18/0x1c
show_stack from dump_stack_lvl+0x58/0x70
dump_stack_lvl from __warn+0xb0/0x198
__warn from warn_slowpath_fmt+0x80/0xb4
warn_slowpath_fmt from check_for_illegal_area+0x130/0x168
check_for_illegal_area from debug_dma_map_sg+0x94/0x368
debug_dma_map_sg from __dma_map_sg_attrs+0x114/0x128
__dma_map_sg_attrs from dma_map_sg_attrs+0x18/0x24
dma_map_sg_attrs from usb_hcd_map_urb_for_dma+0x250/0x3b4
usb_hcd_map_urb_for_dma from usb_hcd_submit_urb+0x194/0x214
usb_hcd_submit_urb from usb_sg_wait+0xa4/0x118
usb_sg_wait from usb_stor_bulk_transfer_sglist+0xa0/0xec
usb_stor_bulk_transfer_sglist from usb_stor_bulk_srb+0x38/0x70
usb_stor_bulk_srb from usb_stor_Bulk_transport+0x150/0x360
usb_stor_Bulk_transport from usb_stor_invoke_transport+0x38/0x440
usb_stor_invoke_transport from usb_stor_control_thread+0x1e0/0x238
usb_stor_control_thread from kthread+0xf8/0x104
kthread from ret_from_fork+0x14/0x2c
Refactor memory_intersects to fix the two problems above.
Before the 1d7db834a0 ("dma-debug: use memory_intersects()
directly"), memory_intersects is called only by printk_late_init:
printk_late_init -> init_section_intersects ->memory_intersects.
There were few places where memory_intersects was called.
When commit 1d7db834a0 ("dma-debug: use memory_intersects()
directly") was merged and CONFIG_DMA_API_DEBUG is enabled, the DMA
subsystem uses it to check for an illegal area and the calltrace above
is triggered.
[akpm@linux-foundation.org: fix nearby comment typo]
Link: https://lkml.kernel.org/r/20220819081145.948016-1-quanyang.wang@windriver.com
Fixes: 9795593625 ("asm/sections: add helpers to check for section data")
Signed-off-by: Quanyang Wang <quanyang.wang@windriver.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Thierry Reding <treding@nvidia.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This reverts commit 96e51ccf1a.
Recently we started running the kernel with rstat infrastructure on
production traffic and begin to see negative memcg stats values.
Particularly the 'sock' stat is the one which we observed having negative
value.
$ grep "sock " /mnt/memory/job/memory.stat
sock 253952
total_sock 18446744073708724224
Re-run after couple of seconds
$ grep "sock " /mnt/memory/job/memory.stat
sock 253952
total_sock 53248
For now we are only seeing this issue on large machines (256 CPUs) and
only with 'sock' stat. I think the networking stack increase the stat on
one cpu and decrease it on another cpu much more often. So, this negative
sock is due to rstat flusher flushing the stats on the CPU that has seen
the decrement of sock but missed the CPU that has increments. A typical
race condition.
For easy stable backport, revert is the most simple solution. For long
term solution, I am thinking of two directions. First is just reduce the
race window by optimizing the rstat flusher. Second is if the reader sees
a negative stat value, force flush and restart the stat collection.
Basically retry but limited.
Link: https://lkml.kernel.org/r/20220817172139.3141101-1-shakeelb@google.com
Fixes: 96e51ccf1a ("memcg: cleanup racy sum avoidance code")
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Cc: "Michal Koutný" <mkoutny@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: <stable@vger.kernel.org> [5.15]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
CRIU is preserving ifindexes of net devices after restoration. However,
current Open vSwitch API does not allow to target ifindex, so we cannot
correctly restore OVS configuration.
Add new OVS_DP_ATTR_IFINDEX for OVS_DP_CMD_NEW and use it as desired
ifindex.
Use OVS_VPORT_ATTR_IFINDEX during OVS_VPORT_CMD_NEW to specify new netdev
ifindex.
Signed-off-by: Andrey Zhadchenko <andrey.zhadchenko@virtuozzo.com>
Acked-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pull io_uring fixes from Jens Axboe:
- Add missing header file to the MAINTAINERS entry for io_uring (Ammar)
- liburing and the kernel ship the same io_uring.h header, but one
change we've had for a long time only in liburing is to ensure it's
C++ safe. Add extern C around it, so we can more easily sync them in
the future (Ammar)
- Fix an off-by-one in the sync cancel added in this merge window (me)
- Error handling fix for passthrough (Kanchan)
- Fix for address saving for async execution for the zc tx support
(Pavel)
- Fix ordering for TCP zc notifications, so we always have them ordered
correctly between "data was sent" and "data was acked". This isn't
strictly needed with the notification slots, but we've been pondering
disabling the slot support for 6.0 - and if we do, then we do require
the ordering to be sane. Regardless of that, it's the sane thing to
do in terms of API (Pavel)
- Minor cleanup for indentation and lockdep annotation (Pavel)
* tag 'io_uring-6.0-2022-08-26' of git://git.kernel.dk/linux-block:
io_uring/net: save address for sendzc async execution
io_uring: conditional ->async_data allocation
io_uring/notif: order notif vs send CQEs
io_uring/net: fix indentation
io_uring/net: fix zc send link failing
io_uring/net: fix must_hold annotation
io_uring: fix submission-failure handling for uring-cmd
io_uring: fix off-by-one in sync cancelation file check
io_uring: uapi: Add `extern "C"` in io_uring.h for liburing
MAINTAINERS: Add `include/linux/io_uring_types.h`
Pull SCSI fixes from James Bottomley:
"Ten fixes.
Of the three core changes, the two large ones are a complete reversion
of the async rework and an ALUA timing rework (the latter shouldn't
affect non-ALUA paths).
The remaining patches are all small and all but one in drivers"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: sd: Revert "Rework asynchronous resume support"
scsi: core: Fix passthrough retry counter handling
scsi: ufs: core: Reduce the power mode change timeout
scsi: storvsc: Remove WQ_MEM_RECLAIM from storvsc_error_wq
scsi: ufs: host: ufs-exynos: Make fsd_ufs_drvs static
scsi: megaraid_sas: Remove unnecessary kfree()
scsi: megaraid_sas: Fix double kfree()
scsi: ufs: core: Enable link lost interrupt
scsi: core: Allow the ALUA transitioning state enough time
scsi: qla2xxx: Disable ATIO interrupt coalesce for quad port ISP27XX
There are several places in the kernel where wait_on_bit is not followed
by a memory barrier (for example, in drivers/md/dm-bufio.c:new_read).
On architectures with weak memory ordering, it may happen that memory
accesses that follow wait_on_bit are reordered before wait_on_bit and
they may return invalid data.
Fix this class of bugs by introducing a new function "test_bit_acquire"
that works like test_bit, but has acquire memory ordering semantics.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Acked-by: Will Deacon <will@kernel.org>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
io-uring cmd support was added through ee692a21e9 ("fs,io_uring:
add infrastructure for uring-cmd"), this extended the struct
file_operations to allow a new command which each subsystem can use
to enable command passthrough. Add an LSM specific for the command
passthrough which enables LSMs to inspect the command details.
This was discussed long ago without no clear pointer for something
conclusive, so this enables LSMs to at least reject this new file
operation.
[0] https://lkml.kernel.org/r/8adf55db-7bab-f59d-d612-ed906b948d19@schaufler-ca.com
Cc: stable@vger.kernel.org
Fixes: ee692a21e9 ("fs,io_uring: add infrastructure for uring-cmd")
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Acked-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Daniel borkmann says:
====================
The following pull-request contains BPF updates for your *net* tree.
We've added 11 non-merge commits during the last 14 day(s) which contain
a total of 13 files changed, 61 insertions(+), 24 deletions(-).
The main changes are:
1) Fix BPF verifier's precision tracking around BPF ring buffer, from Kumar Kartikeya Dwivedi.
2) Fix regression in tunnel key infra when passing FLOWI_FLAG_ANYSRC, from Eyal Birger.
3) Fix insufficient permissions for bpf_sys_bpf() helper, from YiFei Zhu.
4) Fix splat from hitting BUG when purging effective cgroup programs, from Pu Lehui.
5) Fix range tracking for array poke descriptors, from Daniel Borkmann.
6) Fix corrupted packets for XDP_SHARED_UMEM in aligned mode, from Magnus Karlsson.
7) Fix NULL pointer splat in BPF sockmap sk_msg_recvmsg(), from Liu Jian.
8) Add READ_ONCE() to bpf_jit_limit when reading from sysctl, from Kuniyuki Iwashima.
9) Add BPF selftest lru_bug check to s390x deny list, from Daniel Müller.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>