At the moment, bonding xfrm crypto offload can only be set up if the bonding
module is loaded with active-backup mode already set. We need to be able to
make this work with bonds set to AB after the bonding driver has already
been loaded.
So what's done here is:
1) move #define BOND_XFRM_FEATURES to net/bonding.h so it can be used
by both bond_main.c and bond_options.c
2) set BOND_XFRM_FEATURES in bond_dev->hw_features universally, rather than
only when loading in AB mode
3) wire up xfrmdev_ops universally too
4) disable BOND_XFRM_FEATURES in bond_dev->features if not AB
5) exit early (non-AB case) from bond_ipsec_offload_ok, to prevent a
performance hit from traversing into the underlying drivers
5) toggle BOND_XFRM_FEATURES in bond_dev->wanted_features and call
netdev_change_features() from bond_option_mode_set()
In my local testing, I can change bonding modes back and forth on the fly,
have hardware offload work when I'm in AB, and see no performance penalty
to non-AB software encryption, despite having xfrm bits all wired up for
all modes now.
Fixes: 18cb261afd ("bonding: support hardware encryption offload to slaves")
Reported-by: Huy Nguyen <huyn@mellanox.com>
CC: Saeed Mahameed <saeedm@mellanox.com>
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: "David S. Miller" <davem@davemloft.net>
CC: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
CC: Jakub Kicinski <kuba@kernel.org>
CC: Steffen Klassert <steffen.klassert@secunet.com>
CC: Herbert Xu <herbert@gondor.apana.org.au>
CC: netdev@vger.kernel.org
CC: intel-wired-lan@lists.osuosl.org
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Left shifting the u16 value promotes it to a int and then it
gets sign extended to a u64. If len << 16 is greater than 0x7fffffff
then the upper bits get set to 1 because of the implicit sign extension.
Fix this by casting len to u64 before shifting it.
Addresses-Coverity: ("integer handling issues")
Fixes: ed9b7646b0 ("net/tls: Add asynchronous resync")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since 'tcfp_burst' with TICK factor, driver side always need to recover
it to the original value, this patch moves the generic calculation and
recover to the 'burst' original value before offloading to device driver.
Signed-off-by: Po Liu <po.liu@nxp.com>
Acked-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Saeed Mahameed says:
====================
mlx5-tls-2020-06-26
1) Improve hardware layouts and structure for kTLS support
2) Generalize ICOSQ (Internal Channel Operations Send Queue)
Due to the asynchronous nature of adding new kTLS flows and handling
HW asynchronous kTLS resync requests, the XSK ICOSQ was extended to
support generic async operations, such as kTLS add flow and resync, in
addition to the existing XSK usages.
3) kTLS hardware flow steering and classification:
The driver already has the means to classify TCP ipv4/6 flows to send them
to the corresponding RSS HW engine, as reflected in patches 3 through 5,
the series will add a steering layer that will hook to the driver's TCP
classifiers and will match on well known kTLS connection, in case of a
match traffic will be redirected to the kTLS decryption engine, otherwise
traffic will continue flowing normally to the TCP RSS engine.
3) kTLS add flow RX HW offload support
New offload contexts post their static/progress params WQEs
(Work Queue Element) to communicate the newly added kTLS contexts
over the per-channel async ICOSQ.
The Channel/RQ is selected according to the socket's rxq index.
A new TLS-RX workqueue is used to allow asynchronous addition of
steering rules, out of the NAPI context.
It will be also used in a downstream patch in the resync procedure.
Feature is OFF by default. Can be turned on by:
$ ethtool -K <if> tls-hw-rx-offload on
4) Added mlx5 kTLS sw stats and new counters are documented in
Documentation/networking/tls-offload.rst
rx_tls_ctx - number of TLS RX HW offload contexts added to device for
decryption.
rx_tls_ooo - number of RX packets which were part of a TLS stream
but did not arrive in the expected order and triggered the resync
procedure.
rx_tls_del - number of TLS RX HW offload contexts deleted from device
(connection has finished).
rx_tls_err - number of RX packets which were part of a TLS stream
but were not decrypted due to unexpected error in the state machine.
5) Asynchronous RX resync
a. The NIC driver indicates that it would like to resync on some TLS
record within the received packet (P), but the driver does not
know (yet) which of the TLS records within the packet.
At this stage, the NIC driver will query the device to find the exact
TCP sequence for resync (tcpsn), however, the driver does not wait
for the device to provide the response.
b. Eventually, the device responds, and the driver provides the tcpsn
within the resync packet to KTLS. Now, KTLS can check the tcpsn against
any processed TLS records within packet P, and also against any record
that is processed in the future within packet P.
The asynchronous resync path simplifies the device driver, as it can
save bits on the packet completion (32-bit TCP sequence), and pass this
information on an asynchronous command instead.
Performance:
CPU: Intel(R) Xeon(R) CPU E5-2687W v4 @ 3.00GHz, 24 cores, HT off
NIC: ConnectX-6 Dx 100GbE dual port
Goodput (app-layer throughput) comparison:
+---------------+-------+-------+---------+
| # connections | 1 | 4 | 8 |
+---------------+-------+-------+---------+
| SW (Gbps) | 7.26 | 24.70 | 50.30 |
+---------------+-------+-------+---------+
| HW (Gbps) | 18.50 | 64.30 | 92.90 |
+---------------+-------+-------+---------+
| Speedup | 2.55x | 2.56x | 1.85x * |
+---------------+-------+-------+---------+
* After linerate is reached, diff is observed in CPU util
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to allow acting on dropped and/or ECN-marked packets, add two new
qevents to the RED qdisc: "early_drop" and "mark". Filters attached at
"early_drop" block are executed as packets are early-dropped, those
attached at the "mark" block are executed as packets are ECN-marked.
Two new attributes are introduced: TCA_RED_EARLY_DROP_BLOCK with the block
index for the "early_drop" qevent, and TCA_RED_MARK_BLOCK for the "mark"
qevent. Absence of these attributes signifies "don't care": no block is
allocated in that case, or the existing blocks are left intact in case of
the change callback.
For purposes of offloading, blocks attached to these qevents appear with
newly-introduced binder types, FLOW_BLOCK_BINDER_TYPE_RED_EARLY_DROP and
FLOW_BLOCK_BINDER_TYPE_RED_MARK.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Qevents are attach points for TC blocks, where filters can be put that are
executed when "interesting events" take place in a qdisc. The data to keep
and the functions to invoke to maintain a qevent will be largely the same
between qevents. Therefore introduce sched-wide helpers for qevent
management.
Currently, similarly to ingress and egress blocks of clsact pseudo-qdisc,
blocks attachment cannot be changed after the qdisc is created. To that
end, add a helper tcf_qevent_validate_change(), which verifies whether
block index attribute is not attached, or if it is, whether its value
matches the current one (i.e. there is no material change).
The function tcf_qevent_handle() should be invoked when qdisc hits the
"interesting event" corresponding to a block. This function releases root
lock for the duration of executing the attached filters, to allow packets
generated through user actions (notably mirred) to be reinserted to the
same qdisc tree.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A following patch introduces qevents, points in qdisc algorithm where
packet can be processed by user-defined filters. Should this processing
lead to a situation where a new packet is to be enqueued on the same port,
holding the root lock would lead to deadlocks. To solve the issue, qevent
handler needs to unlock and relock the root lock when necessary.
To that end, add the root lock argument to the qdisc op enqueue, and
propagate throughout.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Metadata need not be collected in receive if the packet from bareudp
device is not targeted to openvswitch.
Signed-off-by: Martin <martin.varghese@nokia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds support for asynchronous resynchronization in tls_device.
Async resync follows two distinct stages:
1. The NIC driver indicates that it would like to resync on some TLS
record within the received packet (P), but the driver does not
know (yet) which of the TLS records within the packet.
At this stage, the NIC driver will query the device to find the exact
TCP sequence for resync (tcpsn), however, the driver does not wait
for the device to provide the response.
2. Eventually, the device responds, and the driver provides the tcpsn
within the resync packet to KTLS. Now, KTLS can check the tcpsn against
any processed TLS records within packet P, and also against any record
that is processed in the future within packet P.
The asynchronous resync path simplifies the device driver, as it can
save bits on the packet completion (32-bit TCP sequence), and pass this
information on an asynchronous command instead.
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Minor overlapping changes in xfrm_device.c, between the double
ESP trailing bug fix setting the XFRM_INIT flag and the changes
in net-next preparing for bonding encryption support.
Signed-off-by: David S. Miller <davem@davemloft.net>
If a socket is set ipv6only, it will still send IPv4 addresses in the
INIT and INIT_ACK packets. This potentially misleads the peer into using
them, which then would cause association termination.
The fix is to not add IPv4 addresses to ipv6only sockets.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Reported-by: Corey Minyard <cminyard@mvista.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Tested-by: Corey Minyard <cminyard@mvista.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Hardware device may include more than one police entry. Specifying the
action's index make it possible for several tc filters to share the same
police action when installing the filters.
Propagate this index to device drivers through the flow offload
intermediate representation, so that drivers could share a single
hardware policer between multiple filters.
v1->v2 changes:
- Update the commit message suggest by Ido Schimmel <idosch@idosch.org>
Signed-off-by: Po Liu <Po.Liu@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Current police offloading support the 'burst'' and 'rate_bytes_ps'. Some
hardware own the capability to limit the frame size. If the frame size
larger than the setting, the frame would be dropped. For the police
action itself already accept the 'mtu' parameter in tc command. But not
extend to tc flower offloading. So extend 'mtu' to tc flower offloading.
Signed-off-by: Po Liu <Po.Liu@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Clearing the sock TX queue in sk_set_socket() might cause unexpected
out-of-order transmit when called from sock_orphan(), as outstanding
packets can pick a different TX queue and bypass the ones already queued.
This is undesired in general. More specifically, it breaks the in-order
scheduling property guarantee for device-offloaded TLS sockets.
Remove the call to sk_tx_queue_clear() in sk_set_socket(), and add it
explicitly only where needed.
Fixes: e022f0b4a0 ("net: Introduce sk_tx_queue_mapping")
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Reviewed-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make use of the struct_size() helper instead of an open-coded version
in order to avoid any potential type mistakes. Also, remove unnecessary
function ipv6_rpl_srh_alloc_size() and replace kzalloc() with kcalloc(),
which has a 2-factor argument form for multiplication.
This code was detected with the help of Coccinelle and, audited and
fixed manually.
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This removes following warnings :
CC net/ipv4/udp_offload.o
net/ipv4/udp_offload.c:504:17: warning: no previous prototype for 'udp4_gro_receive' [-Wmissing-prototypes]
504 | struct sk_buff *udp4_gro_receive(struct list_head *head, struct sk_buff *skb)
| ^~~~~~~~~~~~~~~~
net/ipv4/udp_offload.c:584:29: warning: no previous prototype for 'udp4_gro_complete' [-Wmissing-prototypes]
584 | INDIRECT_CALLABLE_SCOPE int udp4_gro_complete(struct sk_buff *skb, int nhoff)
| ^~~~~~~~~~~~~~~~~
CHECK net/ipv6/udp_offload.c
net/ipv6/udp_offload.c:115:16: warning: symbol 'udp6_gro_receive' was not declared. Should it be static?
net/ipv6/udp_offload.c:148:29: warning: symbol 'udp6_gro_complete' was not declared. Should it be static?
CC net/ipv6/udp_offload.o
net/ipv6/udp_offload.c:115:17: warning: no previous prototype for 'udp6_gro_receive' [-Wmissing-prototypes]
115 | struct sk_buff *udp6_gro_receive(struct list_head *head, struct sk_buff *skb)
| ^~~~~~~~~~~~~~~~
net/ipv6/udp_offload.c:148:29: warning: no previous prototype for 'udp6_gro_complete' [-Wmissing-prototypes]
148 | INDIRECT_CALLABLE_SCOPE int udp6_gro_complete(struct sk_buff *skb, int nhoff)
| ^~~~~~~~~~~~~~~~~
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch removes following (C=1 W=1) warnings for CONFIG_RETPOLINE=y :
net/ipv4/tcp_offload.c:306:16: warning: symbol 'tcp4_gro_receive' was not declared. Should it be static?
net/ipv4/tcp_offload.c:306:17: warning: no previous prototype for 'tcp4_gro_receive' [-Wmissing-prototypes]
net/ipv4/tcp_offload.c:319:29: warning: symbol 'tcp4_gro_complete' was not declared. Should it be static?
net/ipv4/tcp_offload.c:319:29: warning: no previous prototype for 'tcp4_gro_complete' [-Wmissing-prototypes]
CHECK net/ipv6/tcpv6_offload.c
net/ipv6/tcpv6_offload.c:16:16: warning: symbol 'tcp6_gro_receive' was not declared. Should it be static?
net/ipv6/tcpv6_offload.c:29:29: warning: symbol 'tcp6_gro_complete' was not declared. Should it be static?
CC net/ipv6/tcpv6_offload.o
net/ipv6/tcpv6_offload.c:16:17: warning: no previous prototype for 'tcp6_gro_receive' [-Wmissing-prototypes]
16 | struct sk_buff *tcp6_gro_receive(struct list_head *head, struct sk_buff *skb)
| ^~~~~~~~~~~~~~~~
net/ipv6/tcpv6_offload.c:29:29: warning: no previous prototype for 'tcp6_gro_complete' [-Wmissing-prototypes]
29 | INDIRECT_CALLABLE_SCOPE int tcp6_gro_complete(struct sk_buff *skb, int thoff)
| ^~~~~~~~~~~~~~~~~
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Declare ipv4_specific once, in tcp.h were it belongs.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ipv6_specific should be declared in tcp include files,
not mptcp.
This removes the following warning :
CHECK net/ipv6/tcp_ipv6.c
net/ipv6/tcp_ipv6.c:78:42: warning: symbol 'ipv6_specific' was not declared. Should it be static?
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove these errors:
net/ipv6/tcp_ipv6.c:1550:29: warning: symbol 'tcp_v6_rcv' was not declared. Should it be static?
net/ipv6/tcp_ipv6.c:1770:30: warning: symbol 'tcp_v6_early_demux' was not declared. Should it be static?
net/ipv6/tcp_ipv6.c:1550:29: warning: no previous prototype for 'tcp_v6_rcv' [-Wmissing-prototypes]
1550 | INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff *skb)
| ^~~~~~~~~~
net/ipv6/tcp_ipv6.c:1770:30: warning: no previous prototype for 'tcp_v6_early_demux' [-Wmissing-prototypes]
1770 | INDIRECT_CALLABLE_SCOPE void tcp_v6_early_demux(struct sk_buff *skb)
| ^~~~~~~~~~~~~~~~~~
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It was reported that a considerable amount of cycles were spent on the
expensive indirect calls on fib6_rule_lookup. This patch introduces an
inline helper called pol_route_func that uses the indirect_call_wrappers
to avoid the indirect calls.
This patch saves around 50ns per call.
Performance was measured on the receiver by checking the amount of
syncookies that server was able to generate under a synflood load.
Traffic was generated using trafgen[1] which was pushing around 1Mpps on
a single queue. Receiver was using only one rx queue which help to
create a bottle neck and make the experiment rx-bounded.
These are the syncookies generated over 10s from the different runs:
Whithout the patch:
TcpExtSyncookiesSent 3553749 0.0
TcpExtSyncookiesSent 3550895 0.0
TcpExtSyncookiesSent 3553845 0.0
TcpExtSyncookiesSent 3541050 0.0
TcpExtSyncookiesSent 3539921 0.0
TcpExtSyncookiesSent 3557659 0.0
TcpExtSyncookiesSent 3526812 0.0
TcpExtSyncookiesSent 3536121 0.0
TcpExtSyncookiesSent 3529963 0.0
TcpExtSyncookiesSent 3536319 0.0
With the patch:
TcpExtSyncookiesSent 3611786 0.0
TcpExtSyncookiesSent 3596682 0.0
TcpExtSyncookiesSent 3606878 0.0
TcpExtSyncookiesSent 3599564 0.0
TcpExtSyncookiesSent 3601304 0.0
TcpExtSyncookiesSent 3609249 0.0
TcpExtSyncookiesSent 3617437 0.0
TcpExtSyncookiesSent 3608765 0.0
TcpExtSyncookiesSent 3620205 0.0
TcpExtSyncookiesSent 3601895 0.0
Without the patch the average is 354263 pkt/s or 2822 ns/pkt and with
the patch the average is 360738 pkt/s or 2772 ns/pkt which gives an
estimate of 50 ns per packet.
[1] http://netsniff-ng.org/
Changelog since v1:
- Change ordering in the ICW (Paolo Abeni)
Cc: Luigi Rizzo <lrizzo@google.com>
Cc: Paolo Abeni <pabeni@redhat.com>
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Brian Vazquez <brianvv@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, this support is limited to active-backup mode, as I'm not sure
about the feasilibity of mapping an xfrm_state's offload handle to
multiple hardware devices simultaneously, and we rely on being able to
pass some hints to both the xfrm and NIC driver about whether or not
they're operating on a slave device.
I've tested this atop an Intel x520 device (ixgbe) using libreswan in
transport mode, succesfully achieving ~4.3Gbps throughput with netperf
(more or less identical to throughput on a bare NIC in this system),
as well as successful failover and recovery mid-netperf.
v2: just use CONFIG_XFRM_OFFLOAD for wrapping, isolate more code with it
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: "David S. Miller" <davem@davemloft.net>
CC: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
CC: Jakub Kicinski <kuba@kernel.org>
CC: Steffen Klassert <steffen.klassert@secunet.com>
CC: Herbert Xu <herbert@gondor.apana.org.au>
CC: netdev@vger.kernel.org
CC: intel-wired-lan@lists.osuosl.org
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
PCI PF and VF devlink port can manage the function represented by a
devlink port.
Allow users to set port function's hardware address.
Example of a PCI VF port which supports a port function:
$ devlink port show pci/0000:06:00.0/2
pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
function:
hw_addr 00:00:00:00:00:00
$ devlink port function set pci/0000:06:00.0/2 hw_addr 00:11:22:33:44:55
$ devlink port show pci/0000:06:00.0/2
pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
function:
hw_addr 00:11:22:33:44:55
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
PCI PF and VF devlink port can manage the function represented by
a devlink port.
Enable users to query port function's hardware address.
Example of a PCI VF port which supports a port function:
$ devlink port show pci/0000:06:00.0/2
pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
function:
hw_addr 00:11:22:33:44:66
$ devlink port show pci/0000:06:00.0/2 -jp
{
"port": {
"pci/0000:06:00.0/2": {
"type": "eth",
"netdev": "enp6s0pf0vf1",
"flavour": "pcivf",
"pfnum": 0,
"vfnum": 1,
"function": {
"hw_addr": "00:11:22:33:44:66"
}
}
}
}
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Mitigate RETPOLINE costs in __tcp_transmit_skb()
by using INDIRECT_CALL_INET() wrapper.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Mitigate RETPOLINE costs in __tcp_transmit_skb()
by using INDIRECT_CALL_INET() wrapper.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This will be useful to allow busy poll for tunneled traffic. In case of
busy poll for sessions over tunnels, the underlying physical device's
queues need to be polled.
Tunnels schedule NAPI either via netif_rx() for backlog queue or
schedule the gro_cell_poll(). netif_rx() propagates the valid skb->napi_id
to the socket. OTOH, gro_cell_poll() stamps the skb->napi_id again by
calling skb_mark_napi_id() with the tunnel NAPI which is not a busy poll
candidate. This was preventing tunneled traffic to use busy poll. A valid
NAPI ID in the skb indicates it was already marked for busy poll by a
NAPI driver and hence needs to be copied into the socket.
Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add infrastructure to l3mdev (the core code for Layer 3 master devices) in
order to find out the corresponding VRF device for a given table id.
Therefore, the l3mdev implementations:
- can register a callback that returns the device index of the l3mdev
associated with a given table id;
- can offer the lookup function (table to VRF device).
Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it>
Signed-off-by: David S. Miller <davem@davemloft.net>
If the representor is removed, then identify the indirect flow_blocks
that need to be removed by the release callback and the port representor
structure. To identify the port representor structure, a new
indr.cb_priv field needs to be introduced. The flow_block also needs to
be removed from the driver list from the cleanup path.
Fixes: 1fac52da59 ("net: flow_offload: consolidate indirect flow_block infrastructure")
Signed-off-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
Prepare fix the bug in the next patch. use flow_indr_block_cb_alloc/remove
function and remove the __flow_block_indr_binding.
Signed-off-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add flow_indr_block_cb_alloc/remove function for next fix patch.
Signed-off-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
Steffen Klassert says:
====================
pull request (net): ipsec 2020-06-19
1) Fix double ESP trailer insertion in IPsec crypto offload if
netif_xmit_frozen_or_stopped is true. From Huy Nguyen.
2) Merge fixup for "remove output_finish indirection from
xfrm_state_afinfo". From Stephen Rothwell.
3) Select CRYPTO_SEQIV for ESP as this is needed for GCM and several
other encryption algorithms. Also modernize the crypto algorithm
selections for ESP and AH, remove those that are maked as "MUST NOT"
and add those that are marked as "MUST" be implemented in RFC 8221.
From Eric Biggers.
Please note the merge conflict between commit:
a7f7f6248d ("treewide: replace '---help---' in Kconfig files with 'help'")
from Linus' tree and commits:
7d4e391959 ("esp, ah: consolidate the crypto algorithm selections")
be01369859 ("esp, ah: modernize the crypto algorithm selections")
from the ipsec tree.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds a drop frames counter to tc flower offloading.
Reporting h/w dropped frames is necessary for some actions.
Some actions like police action and the coming introduced stream gate
action would produce dropped frames which is necessary for user. Status
update shows how many filtered packets increasing and how many dropped
in those packets.
v2: Changes
- Update commit comments suggest by Jiri Pirko.
Signed-off-by: Po Liu <Po.Liu@nxp.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Reviewed-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add tlen field into struct tso_t, and change tso_start()
to return skb_transport_offset(skb) + tso->tlen
This removes from callers the need to use tcp_hdrlen(skb) and
will ease UDP segmentation offload addition.
v2: calls tso_start() earlier in otx2_sq_append_tso() [Jakub]
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
skb argument of tso_count_descs(), tso_build_hdr() and tso_build_data() can be const.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
size field can be an int, no need for size_t
Removes a 32bit hole on 64bit kernels.
And align fields for better readability.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Transport header size could be 60 bytes, and network header
size can also be 60 bytes. Add the Ethernet header and we
are above 128 bytes.
Since drivers using net/core/tso.c usually allocates
one DMA coherent piece of memory per TX queue, this patch
might cause issues if a driver was using too many slots.
For 1024 slots, we would need 256 KB of physically
contiguous memory instead of 128 KB.
Alternative fix would be to add checks in the fast path,
but this involves more work in all drivers using net/core/tso.c.
Fixes: f9cbe9a556 ("net: define the TSO header size in net/tso.h")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Antoine Tenart <antoine.tenart@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, nf_flow_table_offload_add/del_cb are exported by nf_flow_table
module, therefore modules using them will have hard-dependency
on nf_flow_table and will require loading it all the time.
This can lead to an unnecessary overhead on systems that do not
use this API.
To relax the hard-dependency between the modules, we unexport these
functions and make them static inline.
Fixes: 978703f425 ("netfilter: flowtable: Add API for registering to flow table events")
Signed-off-by: Alaa Hleihel <alaa@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, tcf_ct_flow_table_restore_skb is exported by act_ct
module, therefore modules using it will have hard-dependency
on act_ct and will require loading it all the time.
This can lead to an unnecessary overhead on systems that do not
use hardware connection tracking action (ct_metadata action) in
the first place.
To relax the hard-dependency between the modules, we unexport this
function and make it a static inline one.
Fixes: 30b0cf90c6 ("net/sched: act_ct: Support restoring conntrack info on skbs")
Signed-off-by: Alaa Hleihel <alaa@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
1) Fix cfg80211 deadlock, from Johannes Berg.
2) RXRPC fails to send norigications, from David Howells.
3) MPTCP RM_ADDR parsing has an off by one pointer error, fix from
Geliang Tang.
4) Fix crash when using MSG_PEEK with sockmap, from Anny Hu.
5) The ucc_geth driver needs __netdev_watchdog_up exported, from
Valentin Longchamp.
6) Fix hashtable memory leak in dccp, from Wang Hai.
7) Fix how nexthops are marked as FDB nexthops, from David Ahern.
8) Fix mptcp races between shutdown and recvmsg, from Paolo Abeni.
9) Fix crashes in tipc_disc_rcv(), from Tuong Lien.
10) Fix link speed reporting in iavf driver, from Brett Creeley.
11) When a channel is used for XSK and then reused again later for XSK,
we forget to clear out the relevant data structures in mlx5 which
causes all kinds of problems. Fix from Maxim Mikityanskiy.
12) Fix memory leak in genetlink, from Cong Wang.
13) Disallow sockmap attachments to UDP sockets, it simply won't work.
From Lorenz Bauer.
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (83 commits)
net: ethernet: ti: ale: fix allmulti for nu type ale
net: ethernet: ti: am65-cpsw-nuss: fix ale parameters init
net: atm: Remove the error message according to the atomic context
bpf: Undo internal BPF_PROBE_MEM in BPF insns dump
libbpf: Support pre-initializing .bss global variables
tools/bpftool: Fix skeleton codegen
bpf: Fix memlock accounting for sock_hash
bpf: sockmap: Don't attach programs to UDP sockets
bpf: tcp: Recv() should return 0 when the peer socket is closed
ibmvnic: Flush existing work items before device removal
genetlink: clean up family attributes allocations
net: ipa: header pad field only valid for AP->modem endpoint
net: ipa: program upper nibbles of sequencer type
net: ipa: fix modem LAN RX endpoint id
net: ipa: program metadata mask differently
ionic: add pcie_print_link_status
rxrpc: Fix race between incoming ACK parser and retransmitter
net/mlx5: E-Switch, Fix some error pointer dereferences
net/mlx5: Don't fail driver on failure to create debugfs
net/mlx5e: CT: Fix ipv6 nat header rewrite actions
...
Pull sysctl fixes from Al Viro:
"Fixups to regressions in sysctl series"
* 'work.sysctl' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
sysctl: reject gigantic reads/write to sysctl files
cdrom: fix an incorrect __user annotation on cdrom_sysctl_info
trace: fix an incorrect __user annotation on stack_trace_sysctl
random: fix an incorrect __user annotation on proc_do_entropy
net/sysctl: remove leftover __user annotations on neigh_proc_dointvec*
net/sysctl: use cpumask_parse in flow_limit_cpu_sysctl
vxlan driver should be using helpers to access nexthop struct
internals. Remove open check if whether nexthop is multipath in
favor of the existing nexthop_is_multipath helper. Add a new
helper, nexthop_has_v4, to cover the need to check has_v4 in
a group.
Fixes: 1274e1cc42 ("vxlan: ecmp support for mac fdb entries")
Cc: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
fdb nexthops are marked with a flag. For standalone nexthops, a flag was
added to the nh_info struct. For groups that flag was added to struct
nexthop when it should have been added to the group information. Fix
by removing the flag from the nexthop struct and adding a flag to nh_group
that mirrors nh_info and is really only a caching of the individual types.
Add a helper, nexthop_is_fdb, for use by the vxlan code and fixup the
internal code to use the flag from either nh_info or nh_group.
v2
- propagate fdb_nh in remove_nh_grp_entry
Fixes: 38428d6871 ("nexthop: support for fdb ecmp nexthops")
Cc: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove function declarations that are not available in the tree anymore.
Fixes: 709ffbe19b ("net: remove indirect block netdev event registration")
Reported-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>