Commit Graph

16186 Commits

Author SHA1 Message Date
Kuniyuki Iwashima
9804985bf2 udp: Introduce optional per-netns hash table.
The maximum hash table size is 64K due to the nature of the protocol. [0]
It's smaller than TCP, and fewer sockets can cause a performance drop.

On an EC2 c5.24xlarge instance (192 GiB memory), after running iperf3 in
different netns, creating 32Mi sockets without data transfer in the root
netns causes regression for the iperf3's connection.

  uhash_entries		sockets		length		Gbps
	    64K		      1		     1		5.69
			    1Mi		    16		5.27
			    2Mi		    32		4.90
			    4Mi		    64		4.09
			    8Mi		   128		2.96
			   16Mi		   256		2.06
			   32Mi		   512		1.12

The per-netns hash table breaks the lengthy lists into shorter ones.  It is
useful on a multi-tenant system with thousands of netns.  With smaller hash
tables, we can look up sockets faster, isolate noisy neighbours, and reduce
lock contention.

The max size of the per-netns table is 64K as well.  This is because the
possible hash range by udp_hashfn() always fits in 64K within the same
netns and we cannot make full use of the whole buckets larger than 64K.

  /* 0 < num < 64K  ->  X < hash < X + 64K */
  (num + net_hash_mix(net)) & mask;

Also, the min size is 128.  We use a bitmap to search for an available
port in udp_lib_get_port().  To keep the bitmap on the stack and not
fire the CONFIG_FRAME_WARN error at build time, we round up the table
size to 128.

The sysctl usage is the same with TCP:

  $ dmesg | cut -d ' ' -f 6- | grep "UDP hash"
  UDP hash table entries: 65536 (order: 9, 2097152 bytes, vmalloc)

  # sysctl net.ipv4.udp_hash_entries
  net.ipv4.udp_hash_entries = 65536  # can be changed by uhash_entries

  # sysctl net.ipv4.udp_child_hash_entries
  net.ipv4.udp_child_hash_entries = 0  # disabled by default

  # ip netns add test1
  # ip netns exec test1 sysctl net.ipv4.udp_hash_entries
  net.ipv4.udp_hash_entries = -65536  # share the global table

  # sysctl -w net.ipv4.udp_child_hash_entries=100
  net.ipv4.udp_child_hash_entries = 100

  # ip netns add test2
  # ip netns exec test2 sysctl net.ipv4.udp_hash_entries
  net.ipv4.udp_hash_entries = 128  # own a per-netns table with 2^n buckets

We could optimise the hash table lookup/iteration further by removing
the netns comparison for the per-netns one in the future.  Also, we
could optimise the sparse udp_hslot layout by putting it in udp_table.

[0]: https://lore.kernel.org/netdev/4ACC2815.7010101@gmail.com/

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-11-16 09:43:35 +00:00
Kuniyuki Iwashima
67fb43308f udp: Set NULL to sk->sk_prot->h.udp_table.
We will soon introduce an optional per-netns hash table
for UDP.

This means we cannot use the global sk->sk_prot->h.udp_table
to fetch a UDP hash table.

Instead, set NULL to sk->sk_prot->h.udp_table for UDP and get
a proper table from net->ipv4.udp_table.

Note that we still need sk->sk_prot->h.udp_table for UDP LITE.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-11-16 09:43:35 +00:00
Vladimir Oltean
53d04b9811 net: dsa: remove phylink_validate() method
As of now, no DSA driver uses a custom link mode validation procedure
anymore. So remove this DSA operation and let phylink determine what is
supported based on config->mac_capabilities (if provided by the driver).
Leave a comment why we left the code that we did, and that there is more
work to do.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-15 20:34:27 -08:00
Jakub Kicinski
b87584cb8d Merge git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next
Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

1) Fix sparse warning in the new nft_inner expression, reported
   by Jakub Kicinski.

2) Incorrect vlan header check in nft_inner, from Peng Wu.

3) Two patches to pass reset boolean to expression dump operation,
   in preparation for allowing to reset stateful expressions in rules.
   This adds a new NFT_MSG_GETRULE_RESET command. From Phil Sutter.

4) Inconsistent indentation in nft_fib, from Jiapeng Chong.

5) Speed up siphash calculation in conntrack, from Florian Westphal.

* git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
  netfilter: conntrack: use siphash_4u64
  netfilter: rpfilter/fib: clean up some inconsistent indenting
  netfilter: nf_tables: Introduce NFT_MSG_GETRULE_RESET
  netfilter: nf_tables: Extend nft_expr_ops::dump callback parameters
  netfilter: nft_inner: fix return value check in nft_inner_parse_l2l3()
  netfilter: nft_payload: use __be16 to store gre version
====================

Link: https://lore.kernel.org/r/20221115095922.139954-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-15 20:33:20 -08:00
Phil Sutter
8daa8fde3f netfilter: nf_tables: Introduce NFT_MSG_GETRULE_RESET
Analogous to NFT_MSG_GETOBJ_RESET, but for rules: Reset stateful
expressions like counters or quotas. The latter two are the only
consumers, adjust their 'dump' callbacks to respect the parameter
introduced earlier.

Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-11-15 10:53:17 +01:00
Phil Sutter
7d34aa3e03 netfilter: nf_tables: Extend nft_expr_ops::dump callback parameters
Add a 'reset' flag just like with nft_object_ops::dump. This will be
useful to reset "anonymous stateful objects", e.g. simple rule counters.

No functional change intended.

Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-11-15 10:46:34 +01:00
Steen Hegelund
70ea86a0df net: flow_offload: add support for ARP frame matching
This adds a new flow_rule_match_arp function that allows drivers
to be able to dissect ARP frames.

Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-11-14 11:24:16 +00:00
Jakub Kicinski
79b0872b10 Merge branch 'mana-shared-6.2' of https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Long Li says:

====================
Introduce Microsoft Azure Network Adapter (MANA) RDMA driver [netdev prep]

The first 11 patches which modify the MANA Ethernet driver to support
RDMA driver.

* 'mana-shared-6.2' of https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
  net: mana: Define data structures for protection domain and memory registration
  net: mana: Define data structures for allocating doorbell page from GDMA
  net: mana: Define and process GDMA response code GDMA_STATUS_MORE_ENTRIES
  net: mana: Define max values for SGL entries
  net: mana: Move header files to a common location
  net: mana: Record port number in netdev
  net: mana: Export Work Queue functions for use by RDMA driver
  net: mana: Set the DMA device max segment size
  net: mana: Handle vport sharing between devices
  net: mana: Record the physical address for doorbell page region
  net: mana: Add support for auxiliary device
====================

Link: https://lore.kernel.org/all/1667502990-2559-1-git-send-email-longli@linuxonhyperv.com/
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-10 12:07:19 -08:00
Ajay Sharma
28c66cfa45 net: mana: Define data structures for protection domain and memory registration
The MANA hardware support protection domain and memory registration for use
in RDMA environment. Add those definitions and expose them for use by the
RDMA driver.

Signed-off-by: Ajay Sharma <sharmaajay@microsoft.com>
Signed-off-by: Long Li <longli@microsoft.com>
Link: https://lore.kernel.org/r/1667502990-2559-12-git-send-email-longli@linuxonhyperv.com
Reviewed-by: Dexuan Cui <decui@microsoft.com>
Acked-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2022-11-10 07:57:27 +02:00
Long Li
f72ececfc1 net: mana: Define data structures for allocating doorbell page from GDMA
The RDMA device needs to allocate doorbell pages for each user context.
Define the GDMA data structures for use by the RDMA driver.

Reviewed-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Long Li <longli@microsoft.com>
Link: https://lore.kernel.org/r/1667502990-2559-11-git-send-email-longli@linuxonhyperv.com
Acked-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2022-11-10 07:57:27 +02:00
Ajay Sharma
de372f2a9c net: mana: Define and process GDMA response code GDMA_STATUS_MORE_ENTRIES
When doing memory registration, the PF may respond with
GDMA_STATUS_MORE_ENTRIES to indicate a follow request is needed. This is
not an error and should be processed as expected.

Signed-off-by: Ajay Sharma <sharmaajay@microsoft.com>
Reviewed-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Long Li <longli@microsoft.com>
Link: https://lore.kernel.org/r/1667502990-2559-10-git-send-email-longli@linuxonhyperv.com
Acked-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2022-11-10 07:57:27 +02:00
Long Li
aa56549792 net: mana: Define max values for SGL entries
The number of maximum SGl entries should be computed from the maximum
WQE size for the intended queue type and the corresponding OOB data
size. This guarantees the hardware queue can successfully queue requests
up to the queue depth exposed to the upper layer.

Reviewed-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Long Li <longli@microsoft.com>
Link: https://lore.kernel.org/r/1667502990-2559-9-git-send-email-longli@linuxonhyperv.com
Acked-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2022-11-10 07:57:27 +02:00
Long Li
fd325cd648 net: mana: Move header files to a common location
In preparation to add MANA RDMA driver, move all the required header files
to a common location for use by both Ethernet and RDMA drivers.

Reviewed-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Long Li <longli@microsoft.com>
Link: https://lore.kernel.org/r/1667502990-2559-8-git-send-email-longli@linuxonhyperv.com
Acked-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2022-11-10 07:57:26 +02:00
Ido Schimmel
2640a82bbc devlink: Add packet traps for 802.1X operation
Add packet traps for 802.1X operation. The "eapol" control trap is used
to trap EAPOL packets and is required for the correct operation of the
control plane. The "locked_port" drop trap can be enabled to gain
visibility into packets that were dropped by the device due to the
locked bridge port check.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-09 19:06:14 -08:00
Hans J. Schultz
27fabd02ab bridge: switchdev: Allow device drivers to install locked FDB entries
When the bridge is offloaded to hardware, FDB entries are learned and
aged-out by the hardware. Some device drivers synchronize the hardware
and software FDBs by generating switchdev events towards the bridge.

When a port is locked, the hardware must not learn autonomously, as
otherwise any host will blindly gain authorization. Instead, the
hardware should generate events regarding hosts that are trying to gain
authorization and their MAC addresses should be notified by the device
driver as locked FDB entries towards the bridge driver.

Allow device drivers to notify the bridge driver about such entries by
extending the 'switchdev_notifier_fdb_info' structure with the 'locked'
bit. The bit can only be set by device drivers and not by the bridge
driver.

Prevent a locked entry from being installed if MAB is not enabled on the
bridge port.

If an entry already exists in the bridge driver, reject the locked entry
if the current entry does not have the "locked" flag set or if it points
to a different port. The same semantics are implemented in the software
data path.

Signed-off-by: Hans J. Schultz <netdev@kapio-technology.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-09 19:06:13 -08:00
David S. Miller
3ca6c3b43c Merge tag 'rxrpc-next-20221108' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
rxrpc changes

David Howells says:

====================
rxrpc: Increasing SACK size and moving away from softirq, part 1

AF_RXRPC has some issues that need addressing:

 (1) The SACK table has a maximum capacity of 255, but for modern networks
     that isn't sufficient.  This is hard to increase in the upstream code
     because of the way the application thread is coupled to the softirq
     and retransmission side through a ring buffer.  Adjustments to the rx
     protocol allows a capacity of up to 8192, and having a ring
     sufficiently large to accommodate that would use an excessive amount
     of memory as this is per-call.

 (2) Processing ACKs in softirq mode causes the ACKs get conflated, with
     only the most recent being considered.  Whilst this has the upside
     that the retransmission algorithm only needs to deal with the most
     recent ACK, it causes DATA transmission for a call to be very bursty
     because DATA packets cannot be transmitted in softirq mode.  Rather
     transmission must be delegated to either the application thread or a
     workqueue, so there tend to be sudden bursts of traffic for any
     particular call due to scheduling delays.

 (3) All crypto in a single call is done in series; however, each DATA
     packet is individually encrypted so encryption and decryption of large
     calls could be parallelised if spare CPU resources are available.

This is the first of a number of sets of patches that try and address them.
The overall aims of these changes include:

 (1) To get rid of the TxRx ring and instead pass the packets round in
     queues (eg. sk_buff_head).  On the Tx side, each ACK packet comes with
     a SACK table that can be parsed as-is, so there's no particular need
     to maintain our own; we just have to refer to the ACK.

     On the Rx side, we do need to maintain a SACK table with one bit per
     entry - but only if packets go missing - and we don't want to have to
     perform a complex transformation to get the information into an ACK
     packet.

 (2) To try and move almost all processing of received packets out of the
     softirq handler and into a high-priority kernel I/O thread.  Only the
     transferral of packets would be left there.  I would still use the
     encap_rcv hook to receive packets as there's a noticeable performance
     drop from letting the UDP socket put the packets into its own queue
     and then getting them out of there.

 (3) To make the I/O thread also do all the transmission.  The app thread
     would be responsible for packaging the data into packets and then
     buffering them for the I/O thread to transmit.  This would make it
     easier for the app thread to run ahead of the I/O thread, and would
     mean the I/O thread is less likely to have to wait around for a new
     packet to come available for transmission.

 (4) To logically partition the socket/UAPI/KAPI side of things from the
     I/O side of things.  The local endpoint, connection, peer and call
     objects would belong to the I/O side.  The socket side would not then
     touch the private internals of calls and suchlike and would not change
     their states.  It would only look at the send queue, receive queue and
     a way to pass a message to cause an abort.

 (5) To remove as much locking, synchronisation, barriering and atomic ops
     as possible from the I/O side.  Exclusion would be achieved by
     limiting modification of state to the I/O thread only.  Locks would
     still need to be used in communication with the UDP socket and the
     AF_RXRPC socket API.

 (6) To provide crypto offload kernel threads that, when there's slack in
     the system, can see packets that need crypting and provide
     parallelisation in dealing with them.

 (7) To remove the use of system timers.  Since each timer would then send
     a poke to the I/O thread, which would then deal with it when it had
     the opportunity, there seems no point in using system timers if,
     instead, a list of timeouts can be sensibly consulted.  An I/O thread
     only then needs to schedule with a timeout when it is idle.

 (8) To use zero-copy sendmsg to send packets.  This would make use of the
     I/O thread being the sole transmitter on the socket to manage the
     dead-reckoning sequencing of the completion notifications.  There is a
     problem with zero-copy, though: the UDP socket doesn't handle running
     out of option memory very gracefully.

With regard to this first patchset, the changes made include:

 (1) Some fixes, including a fallback for proc_create_net_single_write(),
     setting ack.bufferSize to 0 in ACK packets and a fix for rxrpc
     congestion management, which shouldn't be saving the cwnd value
     between calls.

 (2) Improvements in rxrpc tracepoints, including splitting the timer
     tracepoint into a set-timer and a timer-expired trace.

 (3) Addition of a new proc file to display some stats.

 (4) Some code cleanups, including removing some unused bits and
     unnecessary header inclusions.

 (5) A change to the recently added UDP encap_err_rcv hook so that it has
     the same signature as {ip,ipv6}_icmp_error(), and then just have rxrpc
     point its UDP socket's hook directly at those.

 (6) Definition of a new struct, rxrpc_txbuf, that is used to hold
     transmissible packets of DATA and ACK type in a single 2KiB block
     rather than using an sk_buff.  This allows the buffer to be on a
     number of queues simultaneously more easily, and also guarantees that
     the entire block is in a single unit for zerocopy purposes and that
     the data payload is aligned for in-place crypto purposes.

 (7) ACK txbufs are allocated at proposal and queued for later transmission
     rather than being stored in a single place in the rxrpc_call struct,
     which means only a single ACK can be pending transmission at a time.
     The queue is then drained at various points.  This allows the ACK
     generation code to be simplified.

 (8) The Rx ring buffer is removed.  When a jumbo packet is received (which
     comprises a number of ordinary DATA packets glued together), it used
     to be pointed to by the ring multiple times, with an annotation in a
     side ring indicating which subpacket was in that slot - but this is no
     longer possible.  Instead, the packet is cloned once for each
     subpacket, barring the last, and the range of data is set in the skb
     private area.  This makes it easier for the subpackets in a jumbo
     packet to be decrypted in parallel.

 (9) The Tx ring buffer is removed.  The side annotation ring that held the
     SACK information is also removed.  Instead, in the event of packet
     loss, the SACK data attached an ACK packet is parsed.

(10) Allocate an skcipher request when needed in the rxkad security class
     rather than caching one in the rxrpc_call struct.  This deals with a
     race between externally-driven call disconnection getting rid of the
     skcipher request and sendmsg/recvmsg trying to use it because they
     haven't seen the completion yet.  This is also needed to support
     parallelisation as the skcipher request cannot be used by two or more
     threads simultaneously.

(11) Call udp_sendmsg() and udpv6_sendmsg() directly rather than going
     through kernel_sendmsg() so that we can provide our own iterator
     (zerocopy explicitly doesn't work with a KVEC iterator).  This also
     lets us avoid the overhead of the security hook.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2022-11-09 14:03:49 +00:00
David Howells
42fb06b391 net: Change the udp encap_err_rcv to allow use of {ip,ipv6}_icmp_error()
Change the udp encap_err_rcv signature to match ip_icmp_error() and
ipv6_icmp_error() so that those can be used from the called function and
export them.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
cc: netdev@vger.kernel.org
2022-11-08 16:42:28 +00:00
Xin Long
a21b06e731 net: sched: add helper support in act_ct
This patch is to add helper support in act_ct for OVS actions=ct(alg=xxx)
offloading, which is corresponding to Commit cae3a26275 ("openvswitch:
Allow attaching helpers to ct action") in OVS kernel part.

The difference is when adding TC actions family and proto cannot be got
from the filter/match, other than helper name in tb[TCA_CT_HELPER_NAME],
we also need to send the family in tb[TCA_CT_HELPER_FAMILY] and the
proto in tb[TCA_CT_HELPER_PROTO] to kernel.

Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-11-08 12:15:19 +01:00
Xin Long
f96cba2eb9 net: move add ct helper function to nf_conntrack_helper for ovs and tc
Move ovs_ct_add_helper from openvswitch to nf_conntrack_helper and
rename as nf_ct_add_helper, so that it can be used in TC act_ct in
the next patch.

Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-11-08 12:15:19 +01:00
Xin Long
ca71277f36 net: move the ct helper function to nf_conntrack_helper for ovs and tc
Move ovs_ct_helper from openvswitch to nf_conntrack_helper and rename
as nf_ct_helper so that it can be used in TC act_ct in the next patch.
Note that it also adds the checks for the family and proto, as in TC
act_ct, the packets with correct family and proto are not guaranteed.

Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-11-08 12:15:19 +01:00
Jakub Kicinski
b8fd60c36a genetlink: allow families to use split ops directly
Let families to hook in the new split ops.

They are more flexible and should not be much larger than
full ops. Each split op is 40B while full op is 48B.
Devlink for example has 54 dos and 19 dumps, 2 of the dumps
do not have a do -> 56 full commands = 2688B.
Split ops would have taken 2920B, so 9% more space while
allowing individual per/post doit and per-type policies.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-11-07 12:30:17 +00:00
Jakub Kicinski
20b0b53aca genetlink: introduce split op representation
We currently have two forms of operations - small ops and "full" ops
(or just ops). The former does not have pointers for some of the less
commonly used features (namely dump start/done and policy).

The "full" ops, however, still don't contain all the necessary
information. In particular the policy is per command ID, while
do and dump often accept different attributes. It's also not
possible to define different pre_doit and post_doit callbacks
for different commands within the family.

At the same time a lot of commands do not support dumping and
therefore all the dump-related information is wasted space.

Create a new command representation which can hold info about
a do implementation or a dump implementation, but not both at
the same time.

Use this new representation on the command execution path
(genl_family_rcv_msg) as we either run a do or a dump and
don't have to create a "full" op there.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-11-07 12:30:16 +00:00
Jakub Kicinski
7c3eaa0222 genetlink: move the private fields in struct genl_family
Move the private fields down to form a "private section".
Use the kdoc "private:" label comment thing to hide them
from the main kdoc comment.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Johannes Berg <johannes@sipsolutions.net>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-11-07 12:30:16 +00:00
Jiri Pirko
dca56c3038 net: expose devlink port over rtnetlink
Expose devlink port handle related to netdev over rtnetlink. Introduce a
new nested IFLA attribute to carry the info. Call into devlink code to
fill-up the nest with existing devlink attributes that are used over
devlink netlink.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-03 20:48:37 -07:00
Jiri Pirko
31265c1e29 net: devlink: store copy netdevice ifindex and ifname to allow port_fill() without RTNL held
To avoid a need to take RTNL mutex in port_fill() function, benefit from
the introduce infrastructure that tracks netdevice notifier events.
Store the ifindex and ifname upon register and change name events.
Remove the rtnl_held bool propagated down to port_fill() function as it
is no longer needed.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-03 20:48:35 -07:00
Jiri Pirko
c80965784d net: devlink: remove netdev arg from devlink_port_type_eth_set()
Since devlink_port_type_eth_set() should no longer be called by any
driver with netdev pointer as it should rather use
SET_NETDEV_DEVLINK_PORT, remove the netdev arg. Add a warn to
type_clear()

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-03 20:48:34 -07:00
Jiri Pirko
3830c5719a net: devlink: convert devlink port type-specific pointers to union
Instead of storing type_dev as a void pointer, convert it to union and
use it to store either struct net_device or struct ib_device pointer.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-03 20:48:32 -07:00
Jakub Kicinski
fbeb229a66 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
No conflicts.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-03 13:21:54 -07:00
Daniel Machon
6182d5875c net: dcb: add new apptrust attribute
Add new apptrust extension attributes to the 8021Qaz APP managed object.

Two new attributes, DCB_ATTR_DCB_APP_TRUST_TABLE and
DCB_ATTR_DCB_APP_TRUST, has been added. Trusted selectors are passed in
the nested attribute DCB_ATTR_DCB_APP_TRUST, in order of precedence.

The new attributes are meant to allow drivers, whose hw supports the
notion of trust, to be able to set whether a particular app selector is
trusted - and in which order.

Signed-off-by: Daniel Machon <daniel.machon@microchip.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-11-03 15:16:50 +01:00
Jiri Slaby (SUSE)
777fa87c76 bonding (gcc13): synchronize bond_{a,t}lb_xmit() types
Both bond_alb_xmit() and bond_tlb_xmit() produce a valid warning with
gcc-13:
  drivers/net/bonding/bond_alb.c:1409:13: error: conflicting types for 'bond_tlb_xmit' due to enum/integer mismatch; have 'netdev_tx_t(struct sk_buff *, struct net_device *)' ...
  include/net/bond_alb.h:160:5: note: previous declaration of 'bond_tlb_xmit' with type 'int(struct sk_buff *, struct net_device *)'

  drivers/net/bonding/bond_alb.c:1523:13: error: conflicting types for 'bond_alb_xmit' due to enum/integer mismatch; have 'netdev_tx_t(struct sk_buff *, struct net_device *)' ...
  include/net/bond_alb.h:159:5: note: previous declaration of 'bond_alb_xmit' with type 'int(struct sk_buff *, struct net_device *)'

I.e. the return type of the declaration is int, while the definitions
spell netdev_tx_t. Synchronize both of them to the latter.

Cc: Martin Liska <mliska@suse.cz>
Cc: Jay Vosburgh <j.vosburgh@gmail.com>
Cc: Veaceslav Falico <vfalico@gmail.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Jiri Slaby (SUSE) <jirislaby@kernel.org>
Link: https://lore.kernel.org/r/20221031114409.10417-1-jirislaby@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-02 20:38:13 -07:00
Florian Westphal
ecaf75ffd5 netlink: introduce bigendian integer types
Jakub reported that the addition of the "network_byte_order"
member in struct nla_policy increases size of 32bit platforms.

Instead of scraping the bit from elsewhere Johannes suggested
to add explicit NLA_BE types instead, so do this here.

NLA_POLICY_MAX_BE() macro is removed again, there is no need
for it: NLA_POLICY_MAX(NLA_BE.., ..) will do the right thing.

NLA_BE64 can be added later.

Fixes: 08724ef699 ("netlink: introduce NLA_POLICY_MAX_BE")
Reported-by: Jakub Kicinski <kuba@kernel.org>
Suggested-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/r/20221031123407.9158-1-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-01 21:29:06 -07:00
Eric Dumazet
3bdfb04f13 net: dropreason: add SKB_DROP_REASON_FRAG_TOO_FAR
IPv4 reassembly unit can decide to drop frags based on
/proc/sys/net/ipv4/ipfrag_max_dist sysctl.

Add a specific drop reason to track this specific
and weird case.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-31 20:14:27 -07:00
Eric Dumazet
77adfd3a1d net: dropreason: add SKB_DROP_REASON_FRAG_REASM_TIMEOUT
Used to track skbs freed after a timeout happened
in a reassmbly unit.

Passing a @reason argument to inet_frag_rbtree_purge()
allows to use correct consumed status for frags
that have been successfully re-assembled.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-31 20:14:27 -07:00
Eric Dumazet
4ecbb1c27c net: dropreason: add SKB_DROP_REASON_DUP_FRAG
This is used to track when a duplicate segment received by various
reassembly units is dropped.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-31 20:14:26 -07:00
Eric Dumazet
0e84afe8eb net: dropreason: add SKB_CONSUMED reason
This will allow to simply use in the future:

	kfree_skb_reason(skb, reason);

Instead of repeating sequences like:

	if (dropped)
	    kfree_skb_reason(skb, reason);
	else
	    consume_skb(skb);

For instance, following patch in the series is adding
@reason to skb_release_data() and skb_release_all(),
so that we can propagate a meaningful @reason whenever
consume_skb()/kfree_skb() have to take care of a potential frag_list.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-31 20:14:26 -07:00
Hangbin Liu
f3a63cce1b rtnetlink: Honour NLM_F_ECHO flag in rtnl_delete_link
This patch use the new helper unregister_netdevice_many_notify() for
rtnl_delete_link(), so that the kernel could reply unicast when userspace
 set NLM_F_ECHO flag to request the new created interface info.

At the same time, the parameters of rtnl_delete_link() need to be updated
since we need nlmsghdr and portid info.

Suggested-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-31 18:10:21 -07:00
Hangbin Liu
1d997f1013 rtnetlink: pass netlink message header and portid to rtnl_configure_link()
This patch pass netlink message header and portid to rtnl_configure_link()
All the functions in this call chain need to add the parameters so we can
use them in the last call rtnl_notify(), and notify the userspace about
the new link info if NLM_F_ECHO flag is set.

- rtnl_configure_link()
  - __dev_notify_flags()
    - rtmsg_ifinfo()
      - rtmsg_ifinfo_event()
        - rtmsg_ifinfo_build_skb()
        - rtmsg_ifinfo_send()
	  - rtnl_notify()

Also move __dev_notify_flags() declaration to net/core/dev.h, as Jakub
suggested.

Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-31 18:10:21 -07:00
Jakub Kicinski
8c2a535e08 net: geneve: fix array of flexible structures warnings
New compilers don't like flexible array of flexible structs:

  include/net/geneve.h:62:34: warning: array of flexible structures

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-10-31 10:43:04 +00:00
Jakub Kicinski
738136a0e3 netlink: split up copies in the ack construction
Clean up the use of unsafe_memcpy() by adding a flexible array
at the end of netlink message header and splitting up the header
and data copies.

Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-10-31 09:13:10 +00:00
Pavel Begunkov
fee9ac0664 net: remove SOCK_SUPPORT_ZC from sockmap
sockmap replaces ->sk_prot with its own callbacks, we should remove
SOCK_SUPPORT_ZC as the new proto doesn't support msghdr::ubuf_info.

Cc: <stable@vger.kernel.org> # 6.0
Reported-by: Jakub Kicinski <kuba@kernel.org>
Fixes: e993ffe3da ("net: flag sockets supporting msghdr originated zerocopy")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-28 20:21:25 -07:00
Jakub Kicinski
7354c9024f netlink: hide validation union fields from kdoc
Mark the validation fields as private, users shouldn't set
them directly and they are too complicated to explain in
a more succinct way (there's already a long explanation
in the comment above).

The strict_start_type field is set directly and has a dedicated
comment so move that above the "private" section.

Link: https://lore.kernel.org/r/20221027212107.2639255-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-28 20:19:15 -07:00
Jakub Kicinski
196dd92a00 Kalle Valo says:
====================
pull-request: wireless-next-2022-10-28

First set of patches v6.2. mac80211 refactoring continues for Wi-Fi 7.
All mac80211 driver are now converted to use internal TX queues, this
might cause some regressions so we wanted to do this early in the
cycle.

Note: wireless tree was merged[1] to wireless-next to avoid some
conflicts with mac80211 patches between the trees. Unfortunately there
are still two smaller conflicts in net/mac80211/util.c which Stephen
also reported[2]. In the first conflict initialise scratch_len to
"params->scratch_len ?: 3 * params->len" (note number 3, not 2!) and
in the second conflict take the version which uses elems->scratch_pos.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next.git/commit/?id=dfd2d876b3fda1790bc0239ba4c6967e25d16e91
[2] https://lore.kernel.org/all/20221020032340.5cf101c0@canb.auug.org.au/

mac80211
 - preparation for Wi-Fi 7 Multi-Link Operation (MLO) continues
 - add API to show the link STAs in debugfs
 - all mac80211 drivers are now using mac80211 internal TX queues (iTXQs)

rtw89
 - support 8852BE

rtl8xxxu
 - support RTL8188FU

brmfmac
 - support two station interfaces concurrently

bcma
 - support SPROM rev 11
====================

Link: https://lore.kernel.org/r/20221028132943.304ECC433B5@smtp.kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-28 18:31:40 -07:00
Mubashir Adnan Qureshi
1a91bb7c3e tcp: add PLB functionality for TCP
Congestion control algorithms track PLB state and cause the connection
to trigger a path change when either of the 2 conditions is satisfied:

- No packets are in flight and (# consecutive congested rounds >=
  sysctl_tcp_plb_idle_rehash_rounds)
- (# consecutive congested rounds >= sysctl_tcp_plb_rehash_rounds)

A round (RTT) is marked as congested when congestion signal
(ECN ce_ratio) over an RTT is greater than sysctl_tcp_plb_cong_thresh.
In the event of RTO, PLB (via tcp_write_timeout()) triggers a path
change and disables congestion-triggered path changes for random time
between (sysctl_tcp_plb_suspend_rto_sec, 2*sysctl_tcp_plb_suspend_rto_sec)
to avoid hopping onto the "connectivity blackhole". RTO-triggered
path changes can still happen during this cool-off period.

Signed-off-by: Mubashir Adnan Qureshi <mubashirq@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-10-28 10:47:42 +01:00
Mubashir Adnan Qureshi
bd456f283b tcp: add sysctls for TCP PLB parameters
PLB (Protective Load Balancing) is a host based mechanism for load
balancing across switch links. It leverages congestion signals(e.g. ECN)
from transport layer to randomly change the path of the connection
experiencing congestion. PLB changes the path of the connection by
changing the outgoing IPv6 flow label for IPv6 connections (implemented
in Linux by calling sk_rethink_txhash()). Because of this implementation
mechanism, PLB can currently only work for IPv6 traffic. For more
information, see the SIGCOMM 2022 paper:
  https://doi.org/10.1145/3544216.3544226

This commit adds new sysctl knobs and sets their default values for
TCP PLB.

Signed-off-by: Mubashir Adnan Qureshi <mubashirq@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-10-28 10:47:42 +01:00
Jakub Kicinski
12dee519d4 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next
Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

1) Move struct nft_payload_set definition to .c file where it is
   only used.

2) Shrink transport and inner header offset fields in the nft_pktinfo
   structure to 16-bits, from Florian Westphal.

3) Get rid of nft_objref Kbuild toggle, make it built-in into
   nf_tables. This expression is used to instantiate conntrack helpers
   in nftables. After removing the conntrack helper auto-assignment
   toggle it this feature became more important so move it to the nf_tables
   core module. Also from Florian.

4) Extend the existing function to calculate payload inner header offset
   to deal with the GRE and IPIP transport protocols.

6) Add inner expression support for nf_tables. This new expression
   provides a packet parser for tunneled packets which uses a userspace
   description of the expected inner headers. The inner expression
   invokes the payload expression (via direct call) to match on the
   inner header protocol fields using the inner link, network and
   transport header offsets.

   An example of the bytecode generated from userspace to match on
   IP source encapsulated in a VxLAN packet:

   # nft --debug=netlink add rule netdev x y udp dport 4789 vxlan ip saddr 1.2.3.4
     netdev x y
       [ meta load l4proto => reg 1 ]
       [ cmp eq reg 1 0x00000011 ]
       [ payload load 2b @ transport header + 2 => reg 1 ]
       [ cmp eq reg 1 0x0000b512 ]
       [ inner type vxlan hdrsize 8 flags f [ meta load protocol => reg 1 ] ]
       [ cmp eq reg 1 0x00000008 ]
       [ inner type vxlan hdrsize 8 flags f [ payload load 4b @ network header + 12 => reg 1 ] ]
       [ cmp eq reg 1 0x04030201 ]

7) Store inner link, network and transport header offsets in percpu
   area to parse inner packet header once only. Matching on a different
   tunnel type invalidates existing offsets in the percpu area and it
   invokes the inner tunnel parser again.

8) Add support for inner meta matching. This support for
   NFTA_META_PROTOCOL, which specifies the inner ethertype, and
   NFT_META_L4PROTO, which specifies the inner transport protocol.

9) Extend nft_inner to parse GENEVE optional fields to calculate the
   link layer offset.

10) Update inner expression so tunnel offset points to GRE header
    to normalize tunnel header handling. This also allows to perform
    different interpretations of the GRE header from userspace.

* git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
  netfilter: nft_inner: set tunnel offset to GRE header offset
  netfilter: nft_inner: add geneve support
  netfilter: nft_meta: add inner match support
  netfilter: nft_inner: add percpu inner context
  netfilter: nft_inner: support for inner tunnel header matching
  netfilter: nft_payload: access ipip payload for inner offset
  netfilter: nft_payload: access GRE payload via inner offset
  netfilter: nft_objref: make it builtin
  netfilter: nf_tables: reduce nft_pktinfo by 8 bytes
  netfilter: nft_payload: move struct nft_payload_set definition where it belongs
====================

Link: https://lore.kernel.org/r/20221026132227.3287-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-27 20:41:05 -07:00
Jakub Kicinski
31f1aa4f74 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
drivers/net/can/usb/kvaser_usb/kvaser_usb_leaf.c
  2871edb32f ("can: kvaser_usb: Fix possible completions during init_completion")
  abb8670938 ("can: kvaser_usb_leaf: Ignore stale bus-off after start")
  8d21f5927a ("can: kvaser_usb_leaf: Fix improved state not being reported")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-27 16:56:36 -07:00
Tariq Toukan
28581b9c2c bond: Disable TLS features indication
Bond agnostically interacts with TLS device-offload requests via the
.ndo_sk_get_lower_dev operation. Return value is true iff bond
guarantees fixed mapping between the TLS connection and a lower netdev.

Due to this nature, the bond TLS device offload features are not
explicitly controllable in the bond layer. As of today, these are
read-only values based on the evaluation of bond_sk_check().  However,
this indication might be incorrect and misleading, when the feature bits
are "fixed" by some dependency features.  For example,
NETIF_F_HW_TLS_TX/RX are forcefully cleared in case the corresponding
checksum offload is disabled. But in fact the bond ability to still
offload TLS connections to the lower device is not hurt.

This means that these bits can not be trusted, and hence better become
unused.

This patch revives some old discussion [1] and proposes a much simpler
solution: Clear the bond's TLS features bits. Everyone should stop
reading them.

[1] https://lore.kernel.org/netdev/20210526095747.22446-1-tariqt@nvidia.com/

Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20221025105300.4718-1-tariqt@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-10-27 12:45:38 +02:00
David S. Miller
34e0b94520 Merge tag 'ieee802154-for-net-next-2022-10-25' of git://git.kernel.org/pub/scm/linux/kernel/git/sschmidt/wpan-next
Stefan Schmidt says:

====================

==
One of the biggest cycles for ieee802154 in a long time. We are landing the
first pieces of a big enhancements in managing PAN's. We might have another pull
request ready for this cycle later on, but I want to get this one out first.

Miquel Raynal added support for sending frames synchronously as a dependency
to handle MLME commands. Also introducing more filtering levels to match with
the needs of a device when scanning or operating as a pan coordinator.
To support development and testing the hwsim driver for ieee802154 was also
enhanced for the new filtering levels and to update the PIB attributes.

Alexander Aring fixed quite a few bugs spotted during reviewing changes. He
also added support for TRAC in the atusb driver to have better failure
handling if the firmware provides the needed information.

Jilin Yuan fixed a comment with a repeated word in it.
==================

Signed-off-by: David S. Miller <davem@davemloft.net>
2022-10-26 15:24:36 +01:00
Pablo Neira Ayuso
a150d122b6 netfilter: nft_meta: add inner match support
Add support for inner meta matching on:

- NFT_META_PROTOCOL: to match on the ethertype, this can be used
  regardless tunnel protocol provides no link layer header, in that case
  nft_inner sets on the ethertype based on the IP header version field.
- NFT_META_L4PROTO: to match on the layer 4 protocol.

These meta expression are usually autogenerated as dependencies by
userspace nftables.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-10-25 13:48:42 +02:00
Pablo Neira Ayuso
0e795b37ba netfilter: nft_inner: add percpu inner context
Add NFT_PKTINFO_INNER_FULL flag to annotate that inner offsets are
available. Store nft_inner_tun_ctx object in percpu area to cache
existing inner offsets for this skbuff.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-10-25 13:48:42 +02:00