Commit Graph

75240 Commits

Author SHA1 Message Date
Willem de Bruijn
1b4fa28a8b net: parametrize skb_segment unit test to expand coverage
Expand the test with variants

- GSO_TEST_NO_GSO:      payload size less than or equal to gso_size
- GSO_TEST_FRAGS:       payload in both linear and page frags
- GSO_TEST_FRAGS_PURE:  payload exclusively in page frags
- GSO_TEST_GSO_PARTIAL: produce one gso segment of multiple of gso_size,
                        plus optionally one non-gso trailer segment

Define a test struct that encodes the input gso skb and output segs.
Input in terms of linear and fragment lengths. Output as length of
each segment.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-11 10:39:01 +01:00
Willem de Bruijn
b3098d32ed net: add skb_segment kunit test
Add unit testing for skb segment. This function is exercised by many
different code paths, such as GSO_PARTIAL or GSO_BY_FRAGS, linear
(with or without head_frag), frags or frag_list skbs, etc.

It is infeasible to manually run tests that cover all code paths when
making changes. The long and complex function also makes it hard to
establish through analysis alone that a patch has no unintended
side-effects.

Add code coverage through kunit regression testing. Introduce kunit
infrastructure for tests under net/core, and add this first test.

This first skb_segment test exercises a simple case: a linear skb.
Follow-on patches will parametrize the test and add more variants.

Tested: Built and ran the test with

    make ARCH=um mrproper

    ./tools/testing/kunit/kunit.py run \
        --kconfig_add CONFIG_NET=y \
        --kconfig_add CONFIG_DEBUG_KERNEL=y \
        --kconfig_add CONFIG_DEBUG_INFO=y \
        --kconfig_add=CONFIG_DEBUG_INFO_DWARF_TOOLCHAIN_DEFAULT=y \
        net_core_gso

Signed-off-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-11 10:39:01 +01:00
Nils Hoppmann
a950a5921d net/smc: Fix pos miscalculation in statistics
SMC_STAT_PAYLOAD_SUB(_smc_stats, _tech, key, _len, _rc) will calculate
wrong bucket positions for payloads of exactly 4096 bytes and
(1 << (m + 12)) bytes, with m == SMC_BUF_MAX - 1.

Intended bucket distribution:
Assume l == size of payload, m == SMC_BUF_MAX - 1.

Bucket 0                : 0 < l <= 2^13
Bucket n, 1 <= n <= m-1 : 2^(n+12) < l <= 2^(n+13)
Bucket m                : l > 2^(m+12)

Current solution:
_pos = fls64((l) >> 13)
[...]
_pos = (_pos < m) ? ((l == 1 << (_pos + 12)) ? _pos - 1 : _pos) : m

For l == 4096, _pos == -1, but should be _pos == 0.
For l == (1 << (m + 12)), _pos == m, but should be _pos == m - 1.

In order to avoid special treatment of these corner cases, the
calculation is adjusted. The new solution first subtracts the length by
one, and then calculates the correct bucket by shifting accordingly,
i.e. _pos = fls64((l - 1) >> 13), l > 0.
This not only fixes the issues named above, but also makes the whole
bucket assignment easier to follow.

Same is done for SMC_STAT_RMB_SIZE_SUB(_smc_stats, _tech, k, _len),
where the calculation of the bucket position is similar to the one
named above.

Fixes: e0e4b8fa53 ("net/smc: Add SMC statistics support")
Suggested-by: Halil Pasic <pasic@linux.ibm.com>
Signed-off-by: Nils Hoppmann <niho@linux.ibm.com>
Reviewed-by: Halil Pasic <pasic@linux.ibm.com>
Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-11 10:36:35 +01:00
Yajun Deng
5247dbf16c net/core: Introduce netdev_core_stats_inc()
Although there is a kfree_skb_reason() helper function that can be used to
find the reason why this skb is dropped, but most callers didn't increase
one of rx_dropped, tx_dropped, rx_nohandler and rx_otherhost_dropped.

For the users, people are more concerned about why the dropped in ip
is increasing.

Introduce netdev_core_stats_inc() for trace the caller of
dev_core_stats_*_inc().

Also, add __code to netdev_core_stats_alloc(), as it's called with small
probability. And add noinline make sure netdev_core_stats_inc was never
inlined.

Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
Suggested-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-11 10:07:40 +01:00
Russell King (Oracle)
63b9f7a19f net: dsa: remove dsa_port_phylink_validate()
As all drivers now provide phylink capabilities (including MAC), the
if() condition in dsa_port_phylink_validate() will always be true. We
will always use the generic validator, which phylink will call itself
if the .validate method isn't populated. Thus, there is now no need to
implement the .validate method, so this implementation can be removed.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-11 10:06:05 +01:00
Jakub Kicinski
ad98426a88 Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2023-10-11

We've added 14 non-merge commits during the last 5 day(s) which contain
a total of 12 files changed, 398 insertions(+), 104 deletions(-).

The main changes are:

1) Fix s390 JIT backchain issues in the trampoline code generation which
   previously clobbered the caller's backchain, from Ilya Leoshkevich.

2) Fix zero-size allocation warning in xsk sockets when the configured
   ring size was close to SIZE_MAX, from Andrew Kanner.

3) Fixes for bpf_mprog API that were found when implementing support
   in the ebpf-go library along with selftests, from Daniel Borkmann
   and Lorenz Bauer.

4) Fix riscv JIT to properly sign-extend the return register in programs.
   This fixes various test_progs selftests on riscv, from Björn Töpel.

5) Fix verifier log for async callback return values where the allowed
   range was displayed incorrectly, from David Vernet.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  s390/bpf: Fix unwinding past the trampoline
  s390/bpf: Fix clobbering the caller's backchain in the trampoline
  selftests/bpf: Add testcase for async callback return value failure
  bpf: Fix verifier log for async callback return values
  xdp: Fix zero-size allocation warning in xskq_create()
  riscv, bpf: Track both a0 (RISC-V ABI) and a5 (BPF) return values
  riscv, bpf: Sign-extend return values
  selftests/bpf: Make seen_tc* variable tests more robust
  selftests/bpf: Test query on empty mprog and pass revision into attach
  selftests/bpf: Adapt assert_mprog_count to always expect 0 count
  selftests/bpf: Test bpf_mprog query API via libbpf and raw syscall
  bpf: Refuse unused attributes in bpf_prog_{attach,detach}
  bpf: Handle bpf_mprog_query with NULL entry
  bpf: Fix BPF_PROG_QUERY last field check
====================

Link: https://lore.kernel.org/r/20231010223610.3984-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-10 19:59:49 -07:00
Kory Maincent
108a36d07c ethtool: Fix mod state of verbose no_mask bitset
A bitset without mask in a _SET request means we want exactly the bits in
the bitset to be set. This works correctly for compact format but when
verbose format is parsed, ethnl_update_bitset32_verbose() only sets the
bits present in the request bitset but does not clear the rest. The commit
6699170376 fixes this issue by clearing the whole target bitmap before we
start iterating. The solution proposed brought an issue with the behavior
of the mod variable. As the bitset is always cleared the old val will
always differ to the new val.

Fix it by adding a new temporary variable which save the state of the old
bitmap.

Fixes: 6699170376 ("ethtool: fix application of verbose no_mask bitset")
Signed-off-by: Kory Maincent <kory.maincent@bootlin.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20231009133645.44503-1-kory.maincent@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-10 19:48:15 -07:00
Jakub Kicinski
b52acd02c1 Merge tag 'linux-can-fixes-for-6.6-20231009' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can
Marc Kleine-Budde says:

====================
pull-request: can 2023-10-09

Lukas Magel's patch for the CAN ISO-TP protocol fixes the TX state
detection and wait behavior.

John Watts contributes a patch to only show the sun4i_can Kconfig
option on ARCH_SUNXI.

A patch by Miquel Raynal fixes the soft-reset workaround for Renesas
SoCs in the sja1000 driver.

Markus Schneider-Pargmann's patch for the tcan4x5x m_can glue driver
fixes the id2 register for the tcan4553.

2 patches by Haibo Chen fix the flexcan stop mode for the imx93 SoC.

* tag 'linux-can-fixes-for-6.6-20231009' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can:
  can: tcan4x5x: Fix id2_register for tcan4553
  can: flexcan: remove the auto stop mode for IMX93
  can: sja1000: Always restart the Tx queue after an overrun
  arm64: dts: imx93: add the Flex-CAN stop mode by GPR
  can: sun4i_can: Only show Kconfig if ARCH_SUNXI is set
  can: isotp: isotp_sendmsg(): fix TX state detection and wait behavior
====================

Link: https://lore.kernel.org/r/20231009085256.693378-1-mkl@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-10 19:46:00 -07:00
Eric Dumazet
31c07dffaf net: nfc: fix races in nfc_llcp_sock_get() and nfc_llcp_sock_get_sn()
Sili Luo reported a race in nfc_llcp_sock_get(), leading to UAF.

Getting a reference on the socket found in a lookup while
holding a lock should happen before releasing the lock.

nfc_llcp_sock_get_sn() has a similar problem.

Finally nfc_llcp_recv_snl() needs to make sure the socket
found by nfc_llcp_sock_from_sn() does not disappear.

Fixes: 8f50020ed9 ("NFC: LLCP late binding")
Reported-by: Sili Luo <rootlab@huawei.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willy Tarreau <w@1wt.eu>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Link: https://lore.kernel.org/r/20231009123110.3735515-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-10 19:44:44 -07:00
Jeremy Kerr
5093bbfc10 mctp: perform route lookups under a RCU read-side lock
Our current route lookups (mctp_route_lookup and mctp_route_lookup_null)
traverse the net's route list without the RCU read lock held. This means
the route lookup is subject to preemption, resulting in an potential
grace period expiry, and so an eventual kfree() while we still have the
route pointer.

Add the proper read-side critical section locks around the route
lookups, preventing premption and a possible parallel kfree.

The remaining net->mctp.routes accesses are already under a
rcu_read_lock, or protected by the RTNL for updates.

Based on an analysis from Sili Luo <rootlab@huawei.com>, where
introducing a delay in the route lookup could cause a UAF on
simultaneous sendmsg() and route deletion.

Reported-by: Sili Luo <rootlab@huawei.com>
Fixes: 889b7da23a ("mctp: Add initial routing framework")
Cc: stable@vger.kernel.org
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/29c4b0e67dc1bf3571df3982de87df90cae9b631.1696837310.git.jk@codeconstruct.com.au
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-10 19:43:22 -07:00
Arnd Bergmann
1dab47139e appletalk: remove ipddp driver
After the cops driver is removed, ipddp is now the only
CONFIG_DEV_APPLETALK but as far as I can tell, this also has no users
and can be removed, making appletalk support purely based on ethertalk,
using ethernet hardware.

Link: https://lore.kernel.org/netdev/e490dd0c-a65d-4acf-89c6-c06cb48ec880@app.fastmail.com/
Link: https://lore.kernel.org/netdev/9cac4fbd-9557-b0b8-54fa-93f0290a6fb8@schmorgal.com/
Cc: Doug Brown <doug@schmorgal.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/r/20231009141139.1766345-1-arnd@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-10 17:49:00 -07:00
Florian Westphal
6ac9c51eeb netfilter: conntrack: prefer tcp_error_log to pr_debug
pr_debug doesn't provide any information other than that a packet
did not match existing state but also was found to not create a new
connection.

Replaces this with tcp_error_log, which will also dump packets'
content so one can see if this is a stray FIN or RST.

Signed-off-by: Florian Westphal <fw@strlen.de>
2023-10-10 16:34:28 +02:00
Florian Westphal
8a23f4ab92 netfilter: conntrack: simplify nf_conntrack_alter_reply
nf_conntrack_alter_reply doesn't do helper reassignment anymore.
Remove the comments that make this claim.

Furthermore, remove dead code from the function and place ot
in nf_conntrack.h.

Signed-off-by: Florian Westphal <fw@strlen.de>
2023-10-10 16:34:28 +02:00
Phil Sutter
99ab9f84b8 netfilter: nf_tables: Don't allocate nft_rule_dump_ctx
Since struct netlink_callback::args is not used by rule dumpers anymore,
use it to hold nft_rule_dump_ctx. Add a build-time check to make sure it
won't ever exceed the available space.

Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Florian Westphal <fw@strlen.de>
2023-10-10 16:34:28 +02:00
Phil Sutter
8194d599bc netfilter: nf_tables: Carry s_idx in nft_rule_dump_ctx
In order to move the context into struct netlink_callback's scratch
area, the latter must be unused first.

Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Florian Westphal <fw@strlen.de>
2023-10-10 16:34:28 +02:00
Phil Sutter
405c8fd62d netfilter: nf_tables: Carry reset flag in nft_rule_dump_ctx
This relieves the dump callback from having to check nlmsg_type upon
each call and instead performs the check once in .start callback.

Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Florian Westphal <fw@strlen.de>
2023-10-10 16:34:27 +02:00
Phil Sutter
30fa41a0f6 netfilter: nf_tables: Drop pointless memset when dumping rules
None of the dump callbacks uses netlink_callback::args beyond the first
element, no need to zero the data.

Fixes: 96518518cc ("netfilter: add nftables")
Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Florian Westphal <fw@strlen.de>
2023-10-10 16:34:20 +02:00
Phil Sutter
afed2b54c5 netfilter: nf_tables: Always allocate nft_rule_dump_ctx
It will move into struct netlink_callback's scratch area later, just put
nf_tables_dump_rules_start in shape to reduce churn later.

Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Florian Westphal <fw@strlen.de>
2023-10-10 16:01:42 +02:00
Wedson Almeida Filho
295d3c4412 net: move sockfs_xattr_handlers to .rodata
This makes it harder for accidental or malicious changes to
sockfs_xattr_handlers at runtime.

Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: netdev@vger.kernel.org
Signed-off-by: Wedson Almeida Filho <walmeida@microsoft.com>
Link: https://lore.kernel.org/r/20230930050033.41174-30-wedsonaf@gmail.com
Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-10-10 13:49:21 +02:00
Gerd Bayer
a72178cfe8 net/smc: Fix dependency of SMC on ISM
When the SMC protocol is built into the kernel proper while ISM is
configured to be built as module, linking the kernel fails due to
unresolved dependencies out of net/smc/smc_ism.o to
ism_get_smcd_ops, ism_register_client, and ism_unregister_client
as reported via the linux-next test automation (see link).
This however is a bug introduced a while ago.

Correct the dependency list in ISM's and SMC's Kconfig to reflect the
dependencies that are actually inverted. With this you cannot build a
kernel with CONFIG_SMC=y and CONFIG_ISM=m. Either ISM needs to be 'y',
too - or a 'n'. That way, SMC can still be configured on non-s390
architectures that do not have (nor need) an ISM driver.

Fixes: 89e7d2ba61 ("net/ism: Add new API for client registration")
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Closes: https://lore.kernel.org/linux-next/d53b5b50-d894-4df8-8969-fd39e63440ae@infradead.org/
Co-developed-by: Wenjia Zhang <wenjia@linux.ibm.com>
Signed-off-by: Wenjia Zhang <wenjia@linux.ibm.com>
Signed-off-by: Gerd Bayer <gbayer@linux.ibm.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Simon Horman <horms@kernel.org> # build-tested
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org> # build-tested
Link: https://lore.kernel.org/r/20231006125847.1517840-1-gbayer@linux.ibm.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-10-10 11:51:41 +02:00
David Morley
939463016b tcp: change data receiver flowlabel after one dup
This commit changes the data receiver repath behavior to occur after
receiving a single duplicate. This can help recover ACK connectivity
quicker if a TLP was sent along a nonworking path.

For instance, consider the case where we have an initially nonworking
forward path and reverse path and subsequently switch to only working
forward paths. Before this patch we would have the following behavior.

+---------+--------+--------+----------+----------+----------+
| Event   | For FL | Rev FL | FP Works | RP Works | Data Del |
+---------+--------+--------+----------+----------+----------+
| Initial | A      | 1      | N        | N        | 0        |
+---------+--------+--------+----------+----------+----------+
| TLP     | A      | 1      | N        | N        | 0        |
+---------+--------+--------+----------+----------+----------+
| RTO 1   | B      | 1      | Y        | N        | 1        |
+---------+--------+--------+----------+----------+----------+
| RTO 2   | C      | 1      | Y        | N        | 2        |
+---------+--------+--------+----------+----------+----------+
| RTO 3   | D      | 2      | Y        | Y        | 3        |
+---------+--------+--------+----------+----------+----------+

This patch gets rid of at least RTO 3, avoiding additional unnecessary
repaths of a working forward path to a (potentially) nonworking one.

In addition, this commit changes the behavior to avoid repathing upon
rx of duplicate data if the local endpoint is in CA_Loss (in which
case the RTOs will already be changing the outgoing flowlabel).

Signed-off-by: David Morley <morleyd@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Tested-by: David Morley <morleyd@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-10-10 10:02:59 +02:00
David Morley
95b9a87c6a tcp: record last received ipv6 flowlabel
In order to better estimate whether a data packet has been
retransmitted or is the result of a TLP, we save the last received
ipv6 flowlabel.

To make space for this field we resize the "ato" field in
inet_connection_sock as the current value of TCP_DELACK_MAX can be
fully contained in 8 bits and add a compile_time_assert ensuring this
field is the required size.

v2: addressed kernel bot feedback about dccp_delack_timer()
v3: addressed build error introduced by commit bbf80d713f ("tcp:
derive delack_max from rto_min")

Signed-off-by: David Morley <morleyd@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Tested-by: David Morley <morleyd@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-10-10 10:02:59 +02:00
Ma Ke
513f61e219 net: ipv4: fix return value check in esp_remove_trailer
In esp_remove_trailer(), to avoid an unexpected result returned by
pskb_trim, we should check the return value of pskb_trim().

Signed-off-by: Ma Ke <make_ruc2021@163.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2023-10-10 09:55:35 +02:00
Ma Ke
dad4e491e3 net: ipv6: fix return value check in esp_remove_trailer
In esp_remove_trailer(), to avoid an unexpected result returned by
pskb_trim, we should check the return value of pskb_trim().

Signed-off-by: Ma Ke <make_ruc2021@163.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2023-10-10 09:55:35 +02:00
Eric Dumazet
26c29961b1 net: refine debug info in skb_checksum_help()
syzbot uses panic_on_warn.

This means that the skb_dump() I added in the blamed commit are
not even called.

Rewrite this so that we get the needed skb dump before syzbot crashes.

Fixes: eeee4b77dc ("net: add more debug info in skb_checksum_help()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20231006173355.2254983-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-09 19:46:41 -07:00
Martynas Pumputis
dab4e1f06c bpf: Derive source IP addr via bpf_*_fib_lookup()
Extend the bpf_fib_lookup() helper by making it to return the source
IPv4/IPv6 address if the BPF_FIB_LOOKUP_SRC flag is set.

For example, the following snippet can be used to derive the desired
source IP address:

    struct bpf_fib_lookup p = { .ipv4_dst = ip4->daddr };

    ret = bpf_skb_fib_lookup(skb, p, sizeof(p),
            BPF_FIB_LOOKUP_SRC | BPF_FIB_LOOKUP_SKIP_NEIGH);
    if (ret != BPF_FIB_LKUP_RET_SUCCESS)
        return TC_ACT_SHOT;

    /* the p.ipv4_src now contains the source address */

The inability to derive the proper source address may cause malfunctions
in BPF-based dataplanes for hosts containing netdevs with more than one
routable IP address or for multi-homed hosts.

For example, Cilium implements packet masquerading in BPF. If an
egressing netdev to which the Cilium's BPF prog is attached has
multiple IP addresses, then only one [hardcoded] IP address can be used for
masquerading. This breaks connectivity if any other IP address should have
been selected instead, for example, when a public and private addresses
are attached to the same egress interface.

The change was tested with Cilium [1].

Nikolay Aleksandrov helped to figure out the IPv6 addr selection.

[1]: https://github.com/cilium/cilium/pull/28283

Signed-off-by: Martynas Pumputis <m@lambda.lt>
Link: https://lore.kernel.org/r/20231007081415.33502-2-m@lambda.lt
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-10-09 16:28:35 -07:00
Andrew Kanner
a12bbb3ccc xdp: Fix zero-size allocation warning in xskq_create()
Syzkaller reported the following issue:

  ------------[ cut here ]------------
  WARNING: CPU: 0 PID: 2807 at mm/vmalloc.c:3247 __vmalloc_node_range (mm/vmalloc.c:3361)
  Modules linked in:
  CPU: 0 PID: 2807 Comm: repro Not tainted 6.6.0-rc2+ #12
  Hardware name: Generic DT based system
  unwind_backtrace from show_stack (arch/arm/kernel/traps.c:258)
  show_stack from dump_stack_lvl (lib/dump_stack.c:107 (discriminator 1))
  dump_stack_lvl from __warn (kernel/panic.c:633 kernel/panic.c:680)
  __warn from warn_slowpath_fmt (./include/linux/context_tracking.h:153 kernel/panic.c:700)
  warn_slowpath_fmt from __vmalloc_node_range (mm/vmalloc.c:3361 (discriminator 3))
  __vmalloc_node_range from vmalloc_user (mm/vmalloc.c:3478)
  vmalloc_user from xskq_create (net/xdp/xsk_queue.c:40)
  xskq_create from xsk_setsockopt (net/xdp/xsk.c:953 net/xdp/xsk.c:1286)
  xsk_setsockopt from __sys_setsockopt (net/socket.c:2308)
  __sys_setsockopt from ret_fast_syscall (arch/arm/kernel/entry-common.S:68)

xskq_get_ring_size() uses struct_size() macro to safely calculate the
size of struct xsk_queue and q->nentries of desc members. But the
syzkaller repro was able to set q->nentries with the value initially
taken from copy_from_sockptr() high enough to return SIZE_MAX by
struct_size(). The next PAGE_ALIGN(size) is such case will overflow
the size_t value and set it to 0. This will trigger WARN_ON_ONCE in
vmalloc_user() -> __vmalloc_node_range().

The issue is reproducible on 32-bit arm kernel.

Fixes: 9f78bf330a ("xsk: support use vaddr as ring")
Reported-by: syzbot+fae676d3cf469331fc89@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/000000000000c84b4705fb31741e@google.com/T/
Reported-by: syzbot+b132693e925cbbd89e26@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/000000000000e20df20606ebab4f@google.com/T/
Signed-off-by: Andrew Kanner <andrew.kanner@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: syzbot+fae676d3cf469331fc89@syzkaller.appspotmail.com
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://syzkaller.appspot.com/bug?extid=fae676d3cf469331fc89
Link: https://lore.kernel.org/bpf/20231007075148.1759-1-andrew.kanner@gmail.com
2023-10-09 16:13:29 +02:00
Jordan Rife
7563cf17dc libceph: use kernel_connect()
Direct calls to ops->connect() can overwrite the address parameter when
used in conjunction with BPF SOCK_ADDR hooks. Recent changes to
kernel_connect() ensure that callers are insulated from such side
effects. This patch wraps the direct call to ops->connect() with
kernel_connect() to prevent unexpected changes to the address passed to
ceph_tcp_connect().

This change was originally part of a larger patch targeting the net tree
addressing all instances of unprotected calls to ops->connect()
throughout the kernel, but this change was split up into several patches
targeting various trees.

Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/netdev/20230821100007.559638-1-jrife@google.com/
Link: https://lore.kernel.org/netdev/9944248dba1bce861375fcce9de663934d933ba9.camel@redhat.com/
Fixes: d74bad4e74 ("bpf: Hooks for sys_connect")
Signed-off-by: Jordan Rife <jrife@google.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-10-09 13:35:24 +02:00
David Howells
b5f0e20f44 iov_iter, net: Move hash_and_copy_to_iter() to net/
Move hash_and_copy_to_iter() to be with its only caller in networking code.

Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20230925120309.1731676-13-dhowells@redhat.com
cc: Alexander Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: Christoph Hellwig <hch@lst.de>
cc: Christian Brauner <christian@brauner.io>
cc: Matthew Wilcox <willy@infradead.org>
cc: Linus Torvalds <torvalds@linux-foundation.org>
cc: David Laight <David.Laight@ACULAB.COM>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: linux-block@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
cc: netdev@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-10-09 09:35:14 +02:00
David Howells
7c6f353e8a iov_iter, net: Merge csum_and_copy_from_iter{,_full}() together
Move csum_and_copy_from_iter_full() out of line and then merge
csum_and_copy_from_iter() into its only caller.

Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20230925120309.1731676-12-dhowells@redhat.com
cc: Alexander Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: Christoph Hellwig <hch@lst.de>
cc: Christian Brauner <christian@brauner.io>
cc: Matthew Wilcox <willy@infradead.org>
cc: Linus Torvalds <torvalds@linux-foundation.org>
cc: David Laight <David.Laight@ACULAB.COM>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: linux-block@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
cc: netdev@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-10-09 09:35:14 +02:00
David Howells
dc32bff195 iov_iter, net: Fold in csum_and_memcpy()
Fold csum_and_memcpy() in to its callers.

Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20230925120309.1731676-11-dhowells@redhat.com
cc: Alexander Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: Christoph Hellwig <hch@lst.de>
cc: Christian Brauner <christian@brauner.io>
cc: Matthew Wilcox <willy@infradead.org>
cc: Linus Torvalds <torvalds@linux-foundation.org>
cc: David Laight <David.Laight@ACULAB.COM>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: linux-block@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
cc: netdev@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-10-09 09:35:14 +02:00
David Howells
6d0d419914 iov_iter, net: Move csum_and_copy_to/from_iter() to net/
Move csum_and_copy_to/from_iter() to net code now that the iteration
framework can be #included.

Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20230925120309.1731676-10-dhowells@redhat.com
cc: Alexander Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: Christoph Hellwig <hch@lst.de>
cc: Christian Brauner <christian@brauner.io>
cc: Matthew Wilcox <willy@infradead.org>
cc: Linus Torvalds <torvalds@linux-foundation.org>
cc: David Laight <David.Laight@ACULAB.COM>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: linux-block@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
cc: netdev@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-10-09 09:35:14 +02:00
Eric Dumazet
48533eca60 net: sock_dequeue_err_skb() optimization
Exit early if the list is empty.

Some applications using TCP zerocopy are calling
recvmsg( ... MSG_ERRQUEUE) and hit this case quite often,
probably because busy polling only deals with sk_receive_queue.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20231005114504.642589-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-06 16:29:08 -07:00
Jakub Kicinski
a1fb841f9d Merge tag 'wireless-next-2023-10-06' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next
Kalle Valo says:

====================
wireless-next patches for v6.7

The first pull request for v6.7, with both stack and driver changes.
We have a big change how locking is handled in cfg80211 and mac80211
which removes several locks and hopefully simplifies the locking
overall. In drivers rtw89 got MCC support and smaller features to
other active drivers but nothing out of ordinary.

Major changes:

cfg80211
 - remove wdev mutex, use the wiphy mutex instead
 - annotate iftype_data pointer with sparse
 - first kunit tests, for element defrag
 - remove unused scan_width support

mac80211
 - major locking rework, remove several locks like sta_mtx, key_mtx
   etc. and use the wiphy mutex instead
 - remove unused shifted rate support
 - support antenna control in frame injection (requires driver support)
 - convert RX_DROP_UNUSABLE to more detailed reason codes

rtw89
 - TDMA-based multi-channel concurrency (MCC) support

iwlwifi
 - support set_antenna() operation
 - support frame injection antenna control

ath12k
 - WCN7850: enable 320 MHz channels in 6 GHz band
 - WCN7850: hardware rfkill support
 - WCN7850: enable IEEE80211_HW_SINGLE_SCAN_ON_ALL_BANDS to make scan faster

ath11k
 - add chip id board name while searching board-2.bin

* tag 'wireless-next-2023-10-06' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (272 commits)
  wifi: rtlwifi: remove unreachable code in rtl92d_dm_check_edca_turbo()
  wifi: rtw89: debug: txpwr table supports Wi-Fi 7 chips
  wifi: rtw89: debug: show txpwr table according to chip gen
  wifi: rtw89: phy: set TX power RU limit according to chip gen
  wifi: rtw89: phy: set TX power limit according to chip gen
  wifi: rtw89: phy: set TX power offset according to chip gen
  wifi: rtw89: phy: set TX power by rate according to chip gen
  wifi: rtw89: mac: get TX power control register according to chip gen
  wifi: rtlwifi: use unsigned long for rtl_bssid_entry timestamp
  wifi: rtlwifi: fix EDCA limit set by BT coexistence
  wifi: rt2x00: fix MT7620 low RSSI issue
  wifi: rtw89: refine bandwidth 160MHz uplink OFDMA performance
  wifi: rtw89: refine uplink trigger based control mechanism
  wifi: rtw89: 8851b: update TX power tables to R34
  wifi: rtw89: 8852b: update TX power tables to R35
  wifi: rtw89: 8852c: update TX power tables to R67
  wifi: rtw89: regd: configure Thailand in regulation type
  wifi: mac80211: add back SPDX identifier
  wifi: mac80211: fix ieee80211_drop_unencrypted_mgmt return type/value
  wifi: rtlwifi: cleanup few rtlxxxx_set_hw_reg() routines
  ...

====================

Link: https://lore.kernel.org/r/87jzrz6bvw.fsf@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-06 16:07:29 -07:00
Moshe Shemesh
aba0e909dc devlink: Hold devlink lock on health reporter dump get
Devlink health dump get callback should take devlink lock as any other
devlink callback. Otherwise, since devlink_mutex was removed, this
callback is not protected from a race of the reporter being destroyed
while handling the callback.

Add devlink lock to the callback and to any call for
devlink_health_do_dump(). This should be safe as non of the drivers dump
callback implementation takes devlink lock.

As devlink lock is added to any callback of dump, the reporter dump_lock
is now redundant and can be removed.

Fixes: d3efc2a6a6 ("net: devlink: remove devlink_mutex")
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Link: https://lore.kernel.org/r/1696510216-189379-1-git-send-email-moshe@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-06 15:56:46 -07:00
Jakub Kicinski
e794b089cd Merge tag 'linux-can-next-for-6.7-20231005' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next
Marc Kleine-Budde says:

====================
pull-request: can-next 2023-10-05

The first patch is by Miquel Raynal and fixes a comment in the sja1000
driver.

Vincent Mailhol contributes 2 patches that fix W=1 compiler warnings
in the etas_es58x driver.

Jiapeng Chong's patch removes an unneeded NULL pointer check before
dev_put() in the CAN raw protocol.

A patch by Justin Stittreplaces a strncpy() by strscpy() in the
peak_pci sja1000 driver.

The next 5 patches are by me and fix the can_restart() handler and
replace BUG_ON()s in the CAN dev helpers with proper error handling.

The last 27 patches are also by me and target the at91_can driver.
First a new helper function is introduced, the at91_can driver is
cleaned up and updated to use the rx-offload helper.

* tag 'linux-can-next-for-6.7-20231005' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next: (37 commits)
  can: at91_can: switch to rx-offload implementation
  can: at91_can: at91_alloc_can_err_skb() introduce new function
  can: at91_can: at91_irq_err_line(): send error counters with state change
  can: at91_can: at91_irq_err_line(): make use of can_change_state() and can_bus_off()
  can: at91_can: at91_irq_err_line(): take reg_sr into account for bus off
  can: at91_can: at91_irq_err_line(): make use of can_state_get_by_berr_counter()
  can: at91_can: at91_irq_err(): rename to at91_irq_err_line()
  can: at91_can: at91_irq_err_frame(): move next to at91_irq_err()
  can: at91_can: at91_irq_err_frame(): call directly from IRQ handler
  can: at91_can: at91_poll_err(): increase stats even if no quota left or OOM
  can: at91_can: at91_poll_err(): fold in at91_poll_err_frame()
  can: at91_can: add CAN transceiver support
  can: at91_can: at91_open(): forward request_irq()'s return value in case or an error
  can: at91_can: at91_chip_start(): don't disable IRQs twice
  can: at91_can: at91_set_bittiming(): demote register output to debug level
  can: at91_can: rename struct at91_priv::{tx_next,tx_echo} to {tx_head,tx_tail}
  can: at91_can: at91_setup_mailboxes(): update comments
  can: at91_can: add more register definitions
  can: at91_can: MCR Register: convert to FIELD_PREP()
  can: at91_can: MSR Register: convert to FIELD_PREP()
  ...
====================

Link: https://lore.kernel.org/r/20231005195812.549776-1-mkl@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-06 15:42:12 -07:00
Johannes Berg
7d6904bf26 Merge wireless into wireless-next
Resolve several conflicts, mostly between changes/fixes in
wireless and the locking rework in wireless-next. One of
the conflicts actually shows a bug in wireless that we'll
want to fix separately.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Kalle Valo <kvalo@kernel.org>
2023-10-06 17:08:47 +03:00
Lukas Magel
d9c2ba65e6 can: isotp: isotp_sendmsg(): fix TX state detection and wait behavior
With patch [1], isotp_poll was updated to also queue the poller in the
so->wait queue, which is used for send state changes. Since the queue
now also contains polling tasks that are not interested in sending, the
queue fill state can no longer be used as an indication of send
readiness. As a consequence, nonblocking writes can lead to a race and
lock-up of the socket if there is a second task polling the socket in
parallel.

With this patch, isotp_sendmsg does not consult wq_has_sleepers but
instead tries to atomically set so->tx.state and waits on so->wait if it
is unable to do so. This behavior is in alignment with isotp_poll, which
also checks so->tx.state to determine send readiness.

V2:
- Revert direct exit to goto err_event_drop

[1] https://lore.kernel.org/all/20230331125511.372783-1-michal.sojka@cvut.cz

Reported-by: Maxime Jayat <maxime.jayat@mobile-devices.fr>
Closes: https://lore.kernel.org/linux-can/11328958-453f-447f-9af8-3b5824dfb041@munic.io/
Signed-off-by: Lukas Magel <lukas.magel@posteo.net>
Reviewed-by: Oliver Hartkopp <socketcan@hartkopp.net>
Fixes: 79e19fa79c ("can: isotp: isotp_ops: fix poll() to not report false EPOLLOUT events")
Link: https://github.com/pylessard/python-udsoncan/issues/178#issuecomment-1743786590
Link: https://lore.kernel.org/all/20230827092205.7908-1-lukas.magel@posteo.net
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2023-10-06 12:54:33 +02:00
Gustavo A. R. Silva
c4d49196ce net: sched: cls_u32: Fix allocation size in u32_init()
commit d61491a51f ("net/sched: cls_u32: Replace one-element array
with flexible-array member") incorrecly replaced an instance of
`sizeof(*tp_c)` with `struct_size(tp_c, hlist->ht, 1)`. This results
in a an over-allocation of 8 bytes.

This change is wrong because `hlist` in `struct tc_u_common` is a
pointer:

net/sched/cls_u32.c:
struct tc_u_common {
        struct tc_u_hnode __rcu *hlist;
        void                    *ptr;
        int                     refcnt;
        struct idr              handle_idr;
        struct hlist_node       hnode;
        long                    knodes;
};

So, the use of `struct_size()` makes no sense: we don't need to allocate
any extra space for a flexible-array member. `sizeof(*tp_c)` is just fine.

So, `struct_size(tp_c, hlist->ht, 1)` translates to:

sizeof(*tp_c) + sizeof(tp_c->hlist->ht) ==
sizeof(struct tc_u_common) + sizeof(struct tc_u_knode *) ==
						144 + 8  == 0x98 (byes)
						     ^^^
						      |
						unnecessary extra
						allocation size

$ pahole -C tc_u_common net/sched/cls_u32.o
struct tc_u_common {
	struct tc_u_hnode *        hlist;                /*     0     8 */
	void *                     ptr;                  /*     8     8 */
	int                        refcnt;               /*    16     4 */

	/* XXX 4 bytes hole, try to pack */

	struct idr                 handle_idr;           /*    24    96 */
	/* --- cacheline 1 boundary (64 bytes) was 56 bytes ago --- */
	struct hlist_node          hnode;                /*   120    16 */
	/* --- cacheline 2 boundary (128 bytes) was 8 bytes ago --- */
	long int                   knodes;               /*   136     8 */

	/* size: 144, cachelines: 3, members: 6 */
	/* sum members: 140, holes: 1, sum holes: 4 */
	/* last cacheline: 16 bytes */
};

And with `sizeof(*tp_c)`, we have:

	sizeof(*tp_c) == sizeof(struct tc_u_common) == 144 == 0x90 (bytes)

which is the correct and original allocation size.

Fix this issue by replacing `struct_size(tp_c, hlist->ht, 1)` with
`sizeof(*tp_c)`, and avoid allocating 8 too many bytes.

The following difference in binary output is expected and reflects the
desired change:

| net/sched/cls_u32.o
| @@ -6148,7 +6148,7 @@
| include/linux/slab.h:599
|     2cf5:      mov    0x0(%rip),%rdi        # 2cfc <u32_init+0xfc>
|                        2cf8: R_X86_64_PC32     kmalloc_caches+0xc
|-    2cfc:      mov    $0x98,%edx
|+    2cfc:      mov    $0x90,%edx

Reported-by: Alejandro Colomar <alx@kernel.org>
Closes: https://lore.kernel.org/lkml/09b4a2ce-da74-3a19-6961-67883f634d98@kernel.org/
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-06 11:43:05 +01:00
Kees Cook
b3783e5efd net/packet: Annotate struct packet_fanout with __counted_by
Prepare for the coming implementation by GCC and Clang of the __counted_by
attribute. Flexible array members annotated with __counted_by can have
their accesses bounds-checked at run-time via CONFIG_UBSAN_BOUNDS (for
array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family
functions).

As found with Coccinelle[1], add __counted_by for struct packet_fanout.

Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Anqi Shen <amy.saq@antgroup.com>
Cc: netdev@vger.kernel.org
Link: https://github.com/kees/kernel-tools/blob/trunk/coccinelle/examples/counted_by.cocci [1]
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-06 11:36:25 +01:00
Kees Cook
eaede99c3a netlink: Annotate struct netlink_policy_dump_state with __counted_by
Prepare for the coming implementation by GCC and Clang of the __counted_by
attribute. Flexible array members annotated with __counted_by can have
their accesses bounds-checked at run-time via CONFIG_UBSAN_BOUNDS (for
array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family
functions).

As found with Coccinelle[1], add __counted_by for struct netlink_policy_dump_state.

Additionally update the size of the usage array length before accessing
it. This requires remembering the old size for the memset() and later
assignments.

Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Johannes Berg <johannes.berg@intel.com>
Cc: netdev@vger.kernel.org
Link: https://github.com/kees/kernel-tools/blob/trunk/coccinelle/examples/counted_by.cocci [1]
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-06 10:48:46 +01:00
Florian Westphal
7a0207094f xfrm: policy: replace session decode with flow dissector
xfrm needs to populate ipv4/v6 flow struct for route lookup.
In the past there were several bugs in this code:

1. callers that forget to reload header pointers after
   xfrm_decode_session() (it may pull headers).
2. bugs in decoding where accesses past skb->data occurred.

Meanwhile network core gained a packet dissector as well.
This switches xfrm to the flow dissector.

Changes since RFC:
Drop ipv6 mobiliy header support, AFAIU noone uses this.

Drop extraction of flowlabel, replaced code doesn't set it either.

Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/netdev/20230908120628.26164-3-fw@strlen.de/
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2023-10-06 08:35:12 +02:00
Florian Westphal
45f87dd6b3 xfrm: move mark and oif flowi decode into common code
flowi4_oif/flowi6_oif and mark are aliased to flowi_common field, i.e.
all can be used interchangeably.

Instead of duplicating place this in common code.
No functional changes intended.

Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2023-10-06 08:33:47 +02:00
Florian Westphal
2b1dc6285c xfrm: pass struct net to xfrm_decode_session wrappers
Preparation patch, extra arg is not used.
No functional changes intended.

This is needed to replace the xfrm session decode functions with
the flow dissector.

skb_flow_dissect() cannot be used as-is, because it attempts to deduce the
'struct net' to use for bpf program fetch from skb->sk or skb->dev, but
xfrm code path can see skbs that have neither sk or dev filled in.

So either flow dissector needs to try harder, e.g. by also trying
skb->dst->dev, or we have to pass the struct net explicitly.

Passing the struct net doesn't look too bad to me, most places
already have it available or can derive it from the output device.

Reported-by: kernel test robot <oliver.sang@intel.com>
Link: https://lore.kernel.org/netdev/202309271628.27fd2187-oliver.sang@intel.com/
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2023-10-06 08:31:53 +02:00
Steffen Klassert
221ddb723d xfrm: Support GRO for IPv6 ESP in UDP encapsulation
This patch enables the GRO codepath for IPv6 ESP in UDP encapsulated
packets. Decapsulation happens at L2 and saves a full round through
the stack for each packet. This is also needed to support HW offload
for ESP in UDP encapsulation.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Co-developed-by: Antony Antony <antony.antony@secunet.com>
Signed-off-by: Antony Antony <antony.antony@secunet.com>
Reviewed-by: Eyal Birger <eyal.birger@gmail.com>
2023-10-06 07:31:14 +02:00
Steffen Klassert
172bf009c1 xfrm: Support GRO for IPv4 ESP in UDP encapsulation
This patch enables the GRO codepath for IPv4 ESP in UDP encapsulated
packets. Decapsulation happens at L2 and saves a full round through
the stack for each packet. This is also needed to support HW offload
for ESP in UDP encapsulation.

Enabling this would imporove performance for ESP in UDP datapath, i.e
IPsec with NAT in between.

By default GRP for ESP-in-UDP is disabled for UDP sockets.
To enable this feature for an ESP socket, the following two options
need to be set:
1. enable ESP-in-UDP: (this is already set by an IKE daemon).
   int type = UDP_ENCAP_ESPINUDP;
   setsockopt(fd, SOL_UDP, UDP_ENCAP, &type, sizeof(type));

2. To enable GRO for ESP in UDP socket:
   type = true;
   setsockopt(fd, SOL_UDP, UDP_GRO, &type, sizeof(type));

Enabling ESP-in-UDP has the side effect of preventing the Linux stack from
seeing ESP packets at the L3 (when ESP OFFLOAD is disabled), as packets are
immediately decapsulated from UDP and decrypted.
This change may affect nftable rules that match on ESP packets at L3.
Also tcpdump won't see the ESP packet.

Developers/admins are advised to review and adapt any nftable rules
accordingly before enabling this feature to prevent potential rule breakage.
Also tcpdump will not see from ESP packets from a ESP in UDP flow, when this
is enabled.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Co-developed-by: Antony Antony <antony.antony@secunet.com>
Signed-off-by: Antony Antony <antony.antony@secunet.com>
Reviewed-by: Eyal Birger <eyal.birger@gmail.com>
2023-10-06 07:30:40 +02:00
Steffen Klassert
b439475a0d xfrm: Use the XFRM_GRO to indicate a GRO call on input
This is needed to support GRO for ESP in UDP encapsulation.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Co-developed-by: Antony Antony <antony.antony@secunet.com>
Signed-off-by: Antony Antony <antony.antony@secunet.com>
Reviewed-by: Eyal Birger <eyal.birger@gmail.com>
2023-10-06 07:29:52 +02:00
Kees Cook
0fef0907d6 netem: Annotate struct disttable with __counted_by
Prepare for the coming implementation by GCC and Clang of the __counted_by
attribute. Flexible array members annotated with __counted_by can have
their accesses bounds-checked at run-time via CONFIG_UBSAN_BOUNDS (for
array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family
functions).

As found with Coccinelle[1], add __counted_by for struct disttable.

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Link: https://github.com/kees/kernel-tools/blob/trunk/coccinelle/examples/counted_by.cocci [1]
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>
Link: https://lore.kernel.org/r/20231003231823.work.684-kees@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-05 18:31:49 -07:00
Jakub Kicinski
2606cf059c Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

No conflicts (or adjacent changes of note).

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-05 13:16:47 -07:00
Linus Torvalds
f291209eca Merge tag 'net-6.6-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
 "Including fixes from Bluetooth, netfilter, BPF and WiFi.

  I didn't collect precise data but feels like we've got a lot of 6.5
  fixes here. WiFi fixes are most user-awaited.

  Current release - regressions:

   - Bluetooth: fix hci_link_tx_to RCU lock usage

  Current release - new code bugs:

   - bpf: mprog: fix maximum program check on mprog attachment

   - eth: ti: icssg-prueth: fix signedness bug in prueth_init_tx_chns()

  Previous releases - regressions:

   - ipv6: tcp: add a missing nf_reset_ct() in 3WHS handling

   - vringh: don't use vringh_kiov_advance() in vringh_iov_xfer(), it
     doesn't handle zero length like we expected

   - wifi:
      - cfg80211: fix cqm_config access race, fix crashes with brcmfmac
      - iwlwifi: mvm: handle PS changes in vif_cfg_changed
      - mac80211: fix mesh id corruption on 32 bit systems
      - mt76: mt76x02: fix MT76x0 external LNA gain handling

   - Bluetooth: fix handling of HCI_QUIRK_STRICT_DUPLICATE_FILTER

   - l2tp: fix handling of transhdrlen in __ip{,6}_append_data()

   - dsa: mv88e6xxx: avoid EEPROM timeout when EEPROM is absent

   - eth: stmmac: fix the incorrect parameter after refactoring

  Previous releases - always broken:

   - net: replace calls to sock->ops->connect() with kernel_connect(),
     prevent address rewrite in kernel_bind(); otherwise BPF hooks may
     modify arguments, unexpectedly to the caller

   - tcp: fix delayed ACKs when reads and writes align with MSS

   - bpf:
      - verifier: unconditionally reset backtrack_state masks on global
        func exit
      - s390: let arch_prepare_bpf_trampoline return program size, fix
        struct_ops offsets
      - sockmap: fix accounting of available bytes in presence of PEEKs
      - sockmap: reject sk_msg egress redirects to non-TCP sockets

   - ipv4/fib: send netlink notify when delete source address routes

   - ethtool: plca: fix width of reads when parsing netlink commands

   - netfilter: nft_payload: rebuild vlan header on h_proto access

   - Bluetooth: hci_codec: fix leaking memory of local_codecs

   - eth: intel: ice: always add legacy 32byte RXDID in supported_rxdids

   - eth: stmmac:
     - dwmac-stm32: fix resume on STM32 MCU
     - remove buggy and unneeded stmmac_poll_controller, depend on NAPI

   - ibmveth: always recompute TCP pseudo-header checksum, fix use of
     the driver with Open vSwitch

   - wifi:
      - rtw88: rtw8723d: fix MAC address offset in EEPROM
      - mt76: fix lock dependency problem for wed_lock
      - mwifiex: sanity check data reported by the device
      - iwlwifi: ensure ack flag is properly cleared
      - iwlwifi: mvm: fix a memory corruption due to bad pointer arithm
      - iwlwifi: mvm: fix incorrect usage of scan API

  Misc:

   - wifi: mac80211: work around Cisco AP 9115 VHT MPDU length"

* tag 'net-6.6-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (99 commits)
  MAINTAINERS: update Matthieu's email address
  mptcp: userspace pm allow creating id 0 subflow
  mptcp: fix delegated action races
  net: stmmac: remove unneeded stmmac_poll_controller
  net: lan743x: also select PHYLIB
  net: ethernet: mediatek: disable irq before schedule napi
  net: mana: Fix oversized sge0 for GSO packets
  net: mana: Fix the tso_bytes calculation
  net: mana: Fix TX CQE error handling
  netlink: annotate data-races around sk->sk_err
  sctp: update hb timer immediately after users change hb_interval
  sctp: update transport state when processing a dupcook packet
  tcp: fix delayed ACKs for MSS boundary condition
  tcp: fix quick-ack counting to count actual ACKs of new data
  page_pool: fix documentation typos
  tipc: fix a potential deadlock on &tx->lock
  net: stmmac: dwmac-stm32: fix resume on STM32 MCU
  ipv4: Set offload_failed flag in fibmatch results
  netfilter: nf_tables: nft_set_rbtree: fix spurious insertion failure
  netfilter: nf_tables: Deduplicate nft_register_obj audit logs
  ...
2023-10-05 11:29:21 -07:00