Commit Graph

126748 Commits

Author SHA1 Message Date
David S. Miller
08cbabb77e Merge tag 'mlx5-updates-2021-02-04' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
mlx5-updates-2021-02-04

Vlad Buslov says:
=================

Implement support for VF tunneling

Abstract

Currently, mlx5 only supports configuration with tunnel endpoint IP address on
uplink representor. Remove implicit and explicit assumptions of tunnel always
being terminated on uplink and implement necessary infrastructure for
configuring tunnels on VF representors and updating rules on such tunnels
according to routing changes.

SW TC model

From TC perspective VF tunnel configuration requires two rules in both
directions:

TX rules

1. Rule that redirects packets from UL to VF rep that has the tunnel
endpoint IP address:

$ tc -s filter show dev enp8s0f0 ingress
filter protocol ip pref 4 flower chain 0
filter protocol ip pref 4 flower chain 0 handle 0x1
  dst_mac 16:c9:a0:2d:69:2c
  src_mac 0c:42:a1:58:ab:e4
  eth_type ipv4
  ip_flags nofrag
  in_hw in_hw_count 1
        action order 1: mirred (Egress Redirect to device enp8s0f0_0) stolen
        index 3 ref 1 bind 1 installed 377 sec used 0 sec
        Action statistics:
        Sent 114096 bytes 952 pkt (dropped 0, overlimits 0 requeues 0)
        Sent software 0 bytes 0 pkt
        Sent hardware 114096 bytes 952 pkt
        backlog 0b 0p requeues 0
        cookie 878fa48d8c423fc08c3b6ca599b50a97
        no_percpu
        used_hw_stats delayed

2. Rule that decapsulates the tunneled flow and redirects to destination VF
representor:

$ tc -s filter show dev vxlan_sys_4789 ingress
filter protocol ip pref 4 flower chain 0
filter protocol ip pref 4 flower chain 0 handle 0x1
  dst_mac ca:2e:a7:3f:f5:0f
  src_mac 0a:40:bd:30:89:99
  eth_type ipv4
  enc_dst_ip 7.7.7.5
  enc_src_ip 7.7.7.1
  enc_key_id 98
  enc_dst_port 4789
  enc_tos 0
  ip_flags nofrag
  in_hw in_hw_count 1
        action order 1: tunnel_key  unset pipe
         index 2 ref 1 bind 1 installed 434 sec used 434 sec
        Action statistics:
        Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
        backlog 0b 0p requeues 0
        used_hw_stats delayed

        action order 2: mirred (Egress Redirect to device enp8s0f0_1) stolen
        index 4 ref 1 bind 1 installed 434 sec used 0 sec
        Action statistics:
        Sent 129936 bytes 1082 pkt (dropped 0, overlimits 0 requeues 0)
        Sent software 0 bytes 0 pkt
        Sent hardware 129936 bytes 1082 pkt
        backlog 0b 0p requeues 0
        cookie ac17cf398c4c69e4a5b2f7aabd1b88ff
        no_percpu
        used_hw_stats delayed

RX rules

1. Rule that encapsulates the tunneled flow and redirects packets from
source VF rep to tunnel device:

$ tc -s filter show dev enp8s0f0_1 ingress
filter protocol ip pref 4 flower chain 0
filter protocol ip pref 4 flower chain 0 handle 0x1
  dst_mac 0a:40:bd:30:89:99
  src_mac ca:2e:a7:3f:f5:0f
  eth_type ipv4
  ip_tos 0/0x3
  ip_flags nofrag
  in_hw in_hw_count 1
        action order 1: tunnel_key  set
        src_ip 7.7.7.5
        dst_ip 7.7.7.1
        key_id 98
        dst_port 4789
        nocsum
        ttl 64 pipe
         index 1 ref 1 bind 1 installed 411 sec used 411 sec
        Action statistics:
        Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
        backlog 0b 0p requeues 0
        no_percpu
        used_hw_stats delayed

        action order 2: mirred (Egress Redirect to device vxlan_sys_4789) stolen
        index 1 ref 1 bind 1 installed 411 sec used 0 sec
        Action statistics:
        Sent 5615833 bytes 4028 pkt (dropped 0, overlimits 0 requeues 0)
        Sent software 0 bytes 0 pkt
        Sent hardware 5615833 bytes 4028 pkt
        backlog 0b 0p requeues 0
        cookie bb406d45d343bf7ade9690ae80c7cba4
        no_percpu
        used_hw_stats delayed

2. Rule that redirects from tunnel device to UL rep:

$ tc -s filter show dev vxlan_sys_4789 ingress
filter protocol ip pref 4 flower chain 0
filter protocol ip pref 4 flower chain 0 handle 0x1
  dst_mac ca:2e:a7:3f:f5:0f
  src_mac 0a:40:bd:30:89:99
  eth_type ipv4
  enc_dst_ip 7.7.7.5
  enc_src_ip 7.7.7.1
  enc_key_id 98
  enc_dst_port 4789
  enc_tos 0
  ip_flags nofrag
  in_hw in_hw_count 1
        action order 1: tunnel_key  unset pipe
         index 2 ref 1 bind 1 installed 434 sec used 434 sec
        Action statistics:
        Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
        backlog 0b 0p requeues 0
        used_hw_stats delayed

        action order 2: mirred (Egress Redirect to device enp8s0f0_1) stolen
        index 4 ref 1 bind 1 installed 434 sec used 0 sec
        Action statistics:
        Sent 129936 bytes 1082 pkt (dropped 0, overlimits 0 requeues 0)
        Sent software 0 bytes 0 pkt
        Sent hardware 129936 bytes 1082 pkt
        backlog 0b 0p requeues 0
        cookie ac17cf398c4c69e4a5b2f7aabd1b88ff
        no_percpu
        used_hw_stats delayed

HW offloads model

For hardware offload the goal is to mach packet on both rules without exposing
it to software on tunnel endpoint VF. In order to achieve this for tx, TC
implementation marks encap rules with tunnel endpoint on mlx5 VF of same eswitch
with MLX5_ESW_DEST_CHAIN_WITH_SRC_PORT_CHANGE flag and adds header modification
rule to overwrite packet source port to the value of tunnel VF. Eswitch code is
modified to recirculate such packets after source port value is changed, which
allows second tx rules to match.

For rx path indirect table infrastructure is used to allow fully processing VF
tunnel traffic in hardware. To implement such pipeline driver needs to program
the hardware after matching on UL rule to overwrite source vport from UL to
tunnel VF and recirculate the packet to the root table to allow matching on the
rule installed on tunnel VF. For this, indirect table matches all encapsulated
traffic by tunnel parameters and all other IP traffic is sent to tunnel VF by
the miss rule. Such configuration will cause packet to appear on VF representor
instead of VF itself if packet has been matches by indirect table rule based on
tunnel parameters but missed on second rule (after recirculation). Handle such
case by marking packets processed by indirect table with special 0xFFF value in
reg_c1 and extending slow table with additional flow group that matches on
reg_c0 (source port value set by indirect tables) and reg_c1 (special 0xFFF
mark). When creating offloads fdb tables, install one rule per VF vport to match
on recirculated miss packets and redirect them to appropriate VF vport.

Routing events

In order to support routing changes and migration of tunnel device between
different endpoint VFs, implement routing infrastructure and update it with FIB
events. Routing entry table is introduced to mlx5 TC. Every rx and tx VF tunnel
rule is attached to a routing entry, which is shared for rules of same tunnel.
On FIB event the work is scheduled to delete/recreate all rules of affected
tunnel.

Note: only vxlan tunnel type is supported by this series.

=================
2021-02-08 16:10:16 -08:00
Jakub Kicinski
8e57158683 Merge tag 'batadv-next-pullrequest-20210208' of git://git.open-mesh.org/linux-merge
Simon Wunderlich says:

====================
This feature/cleanup patchset is an updated version of the pull request
of Feb 2nd (batadv-next-pullrequest-20210202) and includes the
following patches:

 - Bump version strings, by Simon Wunderlich (added commit log)

 - Drop publication years from copyright info, by Sven Eckelmann
   (replaced the previous patch which updated copyright years, as per
    our discussion)

 - Avoid sizeof on flexible structure, by Sven Eckelmann (unchanged)

 - Fix names for kernel-doc blocks, by Sven Eckelmann (unchanged)

* tag 'batadv-next-pullrequest-20210208' of git://git.open-mesh.org/linux-merge:
  batman-adv: Fix names for kernel-doc blocks
  batman-adv: Avoid sizeof on flexible structure
  batman-adv: Drop publication years from copyright info
  batman-adv: Start new development cycle
====================

Link: https://lore.kernel.org/r/20210208165938.13262-1-sw@simonwunderlich.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-08 11:32:40 -08:00
Jakub Kicinski
c273a20c30 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter/IPVS updates for net-next

1) Remove indirection and use nf_ct_get() instead from nfnetlink_log
   and nfnetlink_queue, from Florian Westphal.

2) Add weighted random twos choice least-connection scheduling for IPVS,
   from Darby Payne.

3) Add a __hash placeholder in the flow tuple structure to identify
   the field to be included in the rhashtable key hash calculation.

4) Add a new nft_parse_register_load() and nft_parse_register_store()
   to consolidate register load and store in the core.

5) Statify nft_parse_register() since it has no more module clients.

6) Remove redundant assignment in nft_cmp, from Colin Ian King.

* git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next:
  netfilter: nftables: remove redundant assignment of variable err
  netfilter: nftables: statify nft_parse_register()
  netfilter: nftables: add nft_parse_register_store() and use it
  netfilter: nftables: add nft_parse_register_load() and use it
  netfilter: flowtable: add hash offset field to tuple
  ipvs: add weighted random twos choice algorithm
  netfilter: ctnetlink: remove get_ct indirection
====================

Link: https://lore.kernel.org/r/20210206015005.23037-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-06 15:34:23 -08:00
Vladimir Oltean
8fe6832e96 net: dsa: felix: propagate the LAG offload ops towards the ocelot lib
The ocelot switch has been supporting LAG offload since its initial
commit, however felix could not make use of that, due to lack of a LAG
abstraction in DSA. Now that we have that, let's forward DSA's calls
towards the ocelot library, who will deal with setting up the bonding.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-06 14:51:51 -08:00
Vladimir Oltean
23ca3b727e net: mscc: ocelot: rebalance LAGs on link up/down events
At present there is an issue when ocelot is offloading a bonding
interface, but one of the links of the physical ports goes down. Traffic
keeps being hashed towards that destination, and of course gets dropped
on egress.

Monitor the netdev notifier events emitted by the bonding driver for
changes in the physical state of lower interfaces, to determine which
ports are active and which ones are no longer.

Then extend ocelot_get_bond_mask to return either the configured bonding
interfaces, or the active ones, depending on a boolean argument. The
code that does rebalancing only needs to do so among the active ports,
whereas the bridge forwarding mask and the logical port IDs still need
to look at the permanently bonded ports.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-06 14:51:51 -08:00
Vladimir Oltean
528d3f190c net: mscc: ocelot: drop the use of the "lags" array
We can now simplify the implementation by always using ocelot_get_bond_mask
to look up the other ports that are offloading the same bonding interface
as us.

In ocelot_set_aggr_pgids, the code had a way to uniquely iterate through
LAGs. We need to achieve the same behavior by marking each LAG as visited,
which we do now by using a temporary 32-bit "visited" bitmask. This is
ok and we do not need dynamic memory allocation, because we know that
this switch architecture will not have more than 32 ports (the PGID port
masks are 32-bit anyway).

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-06 14:51:50 -08:00
Vladimir Oltean
b80af65969 net: mscc: ocelot: set up the bonding mask in a way that avoids a net_device
Since this code should be called from pure switchdev as well as from
DSA, we must find a way to determine the bonding mask not by looking
directly at the net_device lowers of the bonding interface, since those
could have different private structures.

We keep a pointer to the bonding upper interface, if present, in struct
ocelot_port. Then the bonding mask becomes the bitwise OR of all ports
that have the same bonding upper interface. This adds a duplication of
functionality with the current "lags" array, but the duplication will be
short-lived, since further patches will remove the latter completely.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-06 14:51:50 -08:00
Kevin Hao
3f6e687dff net: Introduce {netdev,napi}_alloc_frag_align()
In the current implementation of {netdev,napi}_alloc_frag(), it doesn't
have any align guarantee for the returned buffer address, But for some
hardwares they do require the DMA buffer to be aligned correctly,
so we would have to use some workarounds like below if the buffers
allocated by the {netdev,napi}_alloc_frag() are used by these hardwares
for DMA.
    buf = napi_alloc_frag(really_needed_size + align);
    buf = PTR_ALIGN(buf, align);

These codes seems ugly and would waste a lot of memories if the buffers
are used in a network driver for the TX/RX. We have added the align
support for the page_frag functions, so add the corresponding
{netdev,napi}_frag functions.

Signed-off-by: Kevin Hao <haokexin@gmail.com>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-06 11:57:28 -08:00
Kevin Hao
b358e2122b mm: page_frag: Introduce page_frag_alloc_align()
In the current implementation of page_frag_alloc(), it doesn't have
any align guarantee for the returned buffer address. But for some
hardwares they do require the DMA buffer to be aligned correctly,
so we would have to use some workarounds like below if the buffers
allocated by the page_frag_alloc() are used by these hardwares for
DMA.
    buf = page_frag_alloc(really_needed_size + align);
    buf = PTR_ALIGN(buf, align);

These codes seems ugly and would waste a lot of memories if the buffers
are used in a network driver for the TX/RX. So introduce
page_frag_alloc_align() to make sure that an aligned buffer address is
returned.

Signed-off-by: Kevin Hao <haokexin@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-06 11:57:28 -08:00
Sven Eckelmann
cfa55c6d47 batman-adv: Drop publication years from copyright info
The batman-adv source code was using the year of publication (to net-next)
as "last" year for the copyright statement. The whole source code mentioned
in the MAINTAINERS "BATMAN ADVANCED" section was handled as a single entity
regarding the publishing year.

This avoided having outdated (in sense of year information - not copyright
holder) publishing information inside several files. But since the simple
"update copyright year" commit (without other changes) in the file was not
well received in the upstream kernel, the option to not have a copyright
year (for initial and last publication) in the files are chosen instead.
More detailed information about the years can still be retrieved from the
SCM system.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Acked-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2021-02-06 09:22:10 +01:00
Vlad Buslov
8e404fefa5 net/mlx5e: Match recirculated packet miss in slow table using reg_c1
Previous patch in series that implements stack devices RX path implements
indirect table rules that match on tunnel VNI. After such rule is created
all tunnel traffic is recirculated to root table. However, recirculated
packet might not match on any rules installed in the table (for example,
when IP traffic follows ARP traffic). In that case packets appear on
representor of tunnel endpoint VF instead being redirected to the VF
itself.

Extend slow table with additional flow group that matches on reg_c0 (source
port value set by indirect tables implemented by previous patch in series)
and reg_c1 (special 0xFFF mark). When creating offloads fdb tables, install
one rule per VF vport to match on recirculated miss packets and redirect
them to appropriate VF vport. Modify indirect tables code to also rewrite
reg_c1 with special 0xFFF mark.

Implementation reuses reg_c1 tunnel id bits. This is safe to do because
recirculated packets are always matched before decapsulation.

Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-02-05 20:53:37 -08:00
Vlad Buslov
48d216e559 net/mlx5e: Refactor reg_c1 usage
Following patch in series uses reg_c1 in eswitch code. To use reg_c1
helpers in both TC and eswitch code, refactor existing helpers according to
similar use case of reg_c0 and move the functionality into eswitch.h.
Calculate reg mappings length from new defines to ensure that they are
always in sync and only need to be changed in single place.

Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-02-05 20:53:37 -08:00
Vlad Buslov
10742efc20 net/mlx5e: VF tunnel TX traffic offloading
When tunnel endpoint is on VF, driver still assumes that endpoint is on
uplink and incorrectly configures encap rule offload according to that
assumption. As a result, traffic is sent directly to the uplink and rules
installed on representor of tunnel endpoint VF are ignored.

Implement following changes to allow offloading tx traffic with tunnel
endpoint on VF:

- For tunneling flows perform route lookup on route and out devices pair.
If out device is uplink and route device is VF of same physical port, then
modify packet reg_c_0 metadata register (source port) with the value of VF
vport. Use eswitch vhca_id->vport mapping introduced in one of previous
patches in the series to obtain vport from route netdevice.

- Recirculate encapsulated packets to VF vport in order to apply any flow
rules installed on VF representor that match on encapsulated traffic.

Only enable support for this functionality when all following conditions
are true:

- Hardware advertises capability to preserve reg_c_0 value on packet
recirculation.

- Vport metadata matching is enabled.

- Termination tables are to be used by the flow.

Example TC rules for VF tunnel traffic:

1. Rule that redirects packets from UL to VF rep that has the tunnel
endpoint IP address:

$ tc -s filter show dev enp8s0f0 ingress
filter protocol ip pref 4 flower chain 0
filter protocol ip pref 4 flower chain 0 handle 0x1
  dst_mac 16:c9:a0:2d:69:2c
  src_mac 0c:42:a1:58:ab:e4
  eth_type ipv4
  ip_flags nofrag
  in_hw in_hw_count 1
        action order 1: mirred (Egress Redirect to device enp8s0f0_0) stolen
        index 3 ref 1 bind 1 installed 377 sec used 0 sec
        Action statistics:
        Sent 114096 bytes 952 pkt (dropped 0, overlimits 0 requeues 0)
        Sent software 0 bytes 0 pkt
        Sent hardware 114096 bytes 952 pkt
        backlog 0b 0p requeues 0
        cookie 878fa48d8c423fc08c3b6ca599b50a97
        no_percpu
        used_hw_stats delayed

2. Rule that decapsulates the tunneled flow and redirects to destination VF
representor:

$ tc -s filter show dev vxlan_sys_4789 ingress
filter protocol ip pref 4 flower chain 0
filter protocol ip pref 4 flower chain 0 handle 0x1
  dst_mac ca:2e:a7:3f:f5:0f
  src_mac 0a:40:bd:30:89:99
  eth_type ipv4
  enc_dst_ip 7.7.7.5
  enc_src_ip 7.7.7.1
  enc_key_id 98
  enc_dst_port 4789
  enc_tos 0
  ip_flags nofrag
  in_hw in_hw_count 1
        action order 1: tunnel_key  unset pipe
         index 2 ref 1 bind 1 installed 434 sec used 434 sec
        Action statistics:
        Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
        backlog 0b 0p requeues 0
        used_hw_stats delayed

        action order 2: mirred (Egress Redirect to device enp8s0f0_1) stolen
        index 4 ref 1 bind 1 installed 434 sec used 0 sec
        Action statistics:
        Sent 129936 bytes 1082 pkt (dropped 0, overlimits 0 requeues 0)
        Sent software 0 bytes 0 pkt
        Sent hardware 129936 bytes 1082 pkt
        backlog 0b 0p requeues 0
        cookie ac17cf398c4c69e4a5b2f7aabd1b88ff
        no_percpu
        used_hw_stats delayed

Co-developed-by: Dmytro Linkin <dlinkin@nvidia.com>
Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com>
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-02-05 20:53:35 -08:00
Leon Romanovsky
edf597da02 netfilter: move handlers to net/ip_vs.h
Fix the following compilation warnings:
net/netfilter/ipvs/ip_vs_proto_tcp.c:147:1: warning: no previous prototype for 'tcp_snat_handler' [-Wmissing-prototypes]
  147 | tcp_snat_handler(struct sk_buff *skb, struct ip_vs_protocol *pp,
      | ^~~~~~~~~~~~~~~~
net/netfilter/ipvs/ip_vs_proto_udp.c:136:1: warning: no previous prototype for 'udp_snat_handler' [-Wmissing-prototypes]
  136 | udp_snat_handler(struct sk_buff *skb, struct ip_vs_protocol *pp,
      | ^~~~~~~~~~~~~~~~

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-04 18:37:57 -08:00
Leon Romanovsky
04f00ab227 net/core: move gro function declarations to separate header
Fir the following compilation warnings:
 1031 | INDIRECT_CALLABLE_SCOPE void udp_v6_early_demux(struct sk_buff *skb)

net/ipv6/ip6_offload.c:182:41: warning: no previous prototype for ‘ipv6_gro_receive’ [-Wmissing-prototypes]
  182 | INDIRECT_CALLABLE_SCOPE struct sk_buff *ipv6_gro_receive(struct list_head *head,
      |                                         ^~~~~~~~~~~~~~~~
net/ipv6/ip6_offload.c:320:29: warning: no previous prototype for ‘ipv6_gro_complete’ [-Wmissing-prototypes]
  320 | INDIRECT_CALLABLE_SCOPE int ipv6_gro_complete(struct sk_buff *skb, int nhoff)
      |                             ^~~~~~~~~~~~~~~~~
net/ipv6/ip6_offload.c:182:41: warning: no previous prototype for ‘ipv6_gro_receive’ [-Wmissing-prototypes]
  182 | INDIRECT_CALLABLE_SCOPE struct sk_buff *ipv6_gro_receive(struct list_head *head,
      |                                         ^~~~~~~~~~~~~~~~
net/ipv6/ip6_offload.c:320:29: warning: no previous prototype for ‘ipv6_gro_complete’ [-Wmissing-prototypes]
  320 | INDIRECT_CALLABLE_SCOPE int ipv6_gro_complete(struct sk_buff *skb, int nhoff)

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-04 18:37:57 -08:00
Leon Romanovsky
f9a4719cc1 ipv6: move udp declarations to net/udp.h
Fix the following compilation warning:

net/ipv6/udp.c:1031:30: warning: no previous prototype for 'udp_v6_early_demux' [-Wmissing-prototypes]
 1031 | INDIRECT_CALLABLE_SCOPE void udp_v6_early_demux(struct sk_buff *skb)
      |                              ^~~~~~~~~~~~~~~~~~
net/ipv6/udp.c:1072:29: warning: no previous prototype for 'udpv6_rcv' [-Wmissing-prototypes]
 1072 | INDIRECT_CALLABLE_SCOPE int udpv6_rcv(struct sk_buff *skb)
      |                             ^~~~~~~~~

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-04 18:37:57 -08:00
Leon Romanovsky
1faba27f11 ipv6: silence compilation warning for non-IPV6 builds
The W=1 compilation of allmodconfig generates the following warning:

net/ipv6/icmp.c:448:6: warning: no previous prototype for 'icmp6_send' [-Wmissing-prototypes]
  448 | void icmp6_send(struct sk_buff *skb, u8 type, u8 code, __u32 info,
      |      ^~~~~~~~~~

Fix it by providing function declaration for builds with ipv6 as a module.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-04 18:37:56 -08:00
Xin Long
a4a600dd30 udp: call udp_encap_enable for v6 sockets when enabling encap
When enabling encap for a ipv6 socket without udp_encap_needed_key
increased, UDP GRO won't work for v4 mapped v6 address packets as
sk will be NULL in udp4_gro_receive().

This patch is to enable it by increasing udp_encap_needed_key for
v6 sockets in udp_tunnel_encap_enable(), and correspondingly
decrease udp_encap_needed_key in udpv6_destroy_sock().

v1->v2:
  - add udp_encap_disable() and export it.
v2->v3:
  - add the change for rxrpc and bareudp into one patch, as Alex
    suggested.
v3->v4:
  - move rxrpc part to another patch.

Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-04 18:37:14 -08:00
Alexander Lobakin
bc38f30f8d net: introduce common dev_page_is_reusable()
A bunch of drivers test the page before reusing/recycling for two
common conditions:
 - if a page was allocated under memory pressure (pfmemalloc page);
 - if a page was allocated at a distant memory node (to exclude
   slowdowns).

Introduce a new common inline for doing this, with likely() already
folded inside to make driver code a bit simpler.

Suggested-by: David Rientjes <rientjes@google.com>
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-04 18:20:14 -08:00
Alexander Lobakin
48f971c9c8 skbuff: constify skb_propagate_pfmemalloc() "page" argument
The function doesn't write anything to the page struct itself,
so this argument can be const.

Misc: align second argument to the brace while at it.

Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-04 18:20:13 -08:00
Alexander Lobakin
1d7bab6a94 mm: constify page_is_pfmemalloc() argument
The function only tests for page->index, so its argument should be
const.

Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-04 18:20:13 -08:00
Brian Vazquez
0053859496 net: add EXPORT_INDIRECT_CALLABLE wrapper
When a static function is annotated with INDIRECT_CALLABLE_SCOPE and
CONFIG_RETPOLINE is set, the static keyword is removed. Sometimes the
function needs to be exported but EXPORT_SYMBOL can't be used because if
CONFIG_RETPOLINE is not set, we will attempt to export a static symbol.

This patch introduces a new indirect call wrapper:
EXPORT_INDIRECT_CALLABLE. This basically does EXPORT_SYMBOL when
CONFIG_RETPOLINE is set, but does nothing when it's not.

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Brian Vazquez <brianvv@google.com>
Link: https://lore.kernel.org/r/20210204181839.558951-1-brianvv@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-04 18:06:26 -08:00
Marcelo Ricardo Leitner
7e3ce05e7f netlink: add tracepoint at NL_SET_ERR_MSG
Often userspace won't request the extack information, or they don't log it
because of log level or so, and even when they do, sometimes it's not
enough to know exactly what caused the error.

Netlink extack is the standard way of reporting erros with descriptive
error messages. With a trace point on it, we then can know exactly where
the error happened, regardless of userspace app. Also, we can even see if
the err msg was overwritten.

The wrapper do_trace_netlink_extack() is because trace points shouldn't be
called from .h files, as trace points are not that small, and the function
call to do_trace_netlink_extack() on the macros is not protected by
tracepoint_enabled() because the macros are called from modules, and this
would require exporting some trace structs. As this is error path, it's
better to export just the wrapper instead.

v2: removed leftover tracepoint declaration

Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/4546b63e67b2989789d146498b13cc09e1fdc543.1612403190.git.marcelo.leitner@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-04 18:05:59 -08:00
Jonas Bonn
49ecc587dc Revert "GTP: add support for flow based tunneling API"
This reverts commit 9ab7e76aef.

This patch was committed without maintainer approval and despite a number
of unaddressed concerns from review.  There are several issues that
impede the acceptance of this patch and that make a reversion of this
particular instance of these changes the best way forward:

i)  the patch contains several logically separate changes that would be
better served as smaller patches (for review purposes)
ii) functionality like the handling of end markers has been introduced
without further explanation
iii) symmetry between the handling of GTPv0 and GTPv1 has been
unnecessarily broken
iv) the patchset produces 'broken' packets when extension headers are
included
v) there are no available userspace tools to allow for testing this
functionality
vi) there is an unaddressed Coverity report against the patch concering
memory leakage
vii) most importantly, the patch contains a large amount of superfluous
churn that impedes other ongoing work with this driver

This patch will be reworked into a series that aligns with other
ongoing work and facilitates review.

Signed-off-by: Jonas Bonn <jonas@norrbonn.se>
Acked-by: Harald Welte <laforge@gnumonks.org>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-04 09:29:57 -08:00
Hariharan Ananthakrishnan
3dd344ea84 net: tracepoint: exposing sk_family in all tcp:tracepoints
Similar to sock:inet_sock_set_state tracepoint, expose sk_family to
distinguish AF_INET and AF_INET6 families.

The following tcp tracepoints are updated:
tcp:tcp_destroy_sock
tcp:tcp_rcv_space_adjust
tcp:tcp_retransmit_skb
tcp:tcp_send_reset
tcp:tcp_receive_reset
tcp:tcp_retransmit_synack
tcp:tcp_probe

Signed-off-by: Hariharan Ananthakrishnan <hari@netflix.com>
Signed-off-by: Brendan Gregg <bgregg@netflix.com>
Link: https://lore.kernel.org/r/20210129001210.344438-1-hari@netflix.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-04 09:25:36 -08:00
Wei Wang
f5a5589c72 tcp: use a smaller percpu_counter batch size for sk_alloc
Currently, a percpu_counter with the default batch size (2*nr_cpus) is
used to record the total # of active sockets per protocol. This means
sk_sockets_allocated_read_positive() could be off by +/-2*(nr_cpus^2).
This under/over-estimation could lead to wrong memory suppression
conditions in __sk_raise_mem_allocated().
Fix this by using a more reasonable fixed batch size of 16.

See related commit cf86a086a1 ("net/dst: use a smaller percpu_counter
batch for dst entries accounting") that addresses a similar issue.

Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Link: https://lore.kernel.org/r/20210202193408.1171634-1-weiwan@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-03 18:43:17 -08:00
Danielle Ratson
c8907043c6 ethtool: Get link mode in use instead of speed and duplex parameters
Currently, when user space queries the link's parameters, as speed and
duplex, each parameter is passed from the driver to ethtool.

Instead, get the link mode bit in use, and derive each of the parameters
from it in ethtool.

Signed-off-by: Danielle Ratson <danieller@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-03 18:37:29 -08:00
Danielle Ratson
012ce4dd31 ethtool: Extend link modes settings uAPI with lanes
Currently, when auto negotiation is on, the user can advertise all the
linkmodes which correspond to a specific speed, but does not have a
similar selector for the number of lanes. This is significant when a
specific speed can be achieved using different number of lanes.  For
example, 2x50 or 4x25.

Add 'ETHTOOL_A_LINKMODES_LANES' attribute and expand 'struct
ethtool_link_settings' with lanes field in order to implement a new
lanes-selector that will enable the user to advertise a specific number
of lanes as well.

When auto negotiation is off, lanes parameter can be forced only if the
driver supports it. Add a capability bit in 'struct ethtool_ops' that
allows ethtool know if the driver can handle the lanes parameter when
auto negotiation is off, so if it does not, an error message will be
returned when trying to set lanes.

Example:

$ ethtool -s swp1 lanes 4
$ ethtool swp1
  Settings for swp1:
	Supported ports: [ FIBRE ]
        Supported link modes:   1000baseKX/Full
                                10000baseKR/Full
                                40000baseCR4/Full
				40000baseSR4/Full
				40000baseLR4/Full
                                25000baseCR/Full
                                25000baseSR/Full
				50000baseCR2/Full
                                100000baseSR4/Full
				100000baseCR4/Full
        Supported pause frame use: Symmetric Receive-only
        Supports auto-negotiation: Yes
        Supported FEC modes: Not reported
        Advertised link modes:  40000baseCR4/Full
				40000baseSR4/Full
				40000baseLR4/Full
                                100000baseSR4/Full
				100000baseCR4/Full
        Advertised pause frame use: No
        Advertised auto-negotiation: Yes
        Advertised FEC modes: Not reported
        Speed: Unknown!
        Duplex: Unknown! (255)
        Auto-negotiation: on
        Port: Direct Attach Copper
        PHYAD: 0
        Transceiver: internal
        Link detected: no

Signed-off-by: Danielle Ratson <danieller@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-03 18:37:28 -08:00
Brian Vazquez
bbd807dfbf net: indirect call helpers for ipv4/ipv6 dst_check functions
This patch avoids the indirect call for the common case:
ip6_dst_check and ipv4_dst_check

Signed-off-by: Brian Vazquez <brianvv@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-03 14:51:40 -08:00
Brian Vazquez
f67fbeaebd net: use indirect call helpers for dst_mtu
This patch avoids the indirect call for the common case:
ip6_mtu and ipv4_mtu

Signed-off-by: Brian Vazquez <brianvv@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-03 14:51:39 -08:00
Brian Vazquez
6585d7dc49 net: use indirect call helpers for dst_output
This patch avoids the indirect call for the common case:
ip6_output and ip_output

Signed-off-by: Brian Vazquez <brianvv@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-03 14:51:39 -08:00
Brian Vazquez
e43b219064 net: use indirect call helpers for dst_input
This patch avoids the indirect call for the common case:
ip_local_deliver and ip6_input

Signed-off-by: Brian Vazquez <brianvv@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-03 14:51:39 -08:00
Emil Renner Berthing
4f4e54366e net: usb: cdc_ncm: use new API for bh tasklet
This converts the driver to use the new tasklet API introduced in
commit 12cc923f1c ("tasklet: Introduce new initialization API")

It is unfortunate that we need to add a pointer to the driver context to
get back to the usbnet device, but the space will be reclaimed once
there are no more users of the old API left and we can remove the data
value and flag from the tasklet struct.

Signed-off-by: Emil Renner Berthing <kernel@esmil.dk>
Link: https://lore.kernel.org/r/20210130234637.26505-1-kernel@esmil.dk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-03 14:22:21 -08:00
Jakub Kicinski
0256317a61 Merge tag 'mac80211-next-for-net-next-2021-02-02' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next
Johannes Berg says:

====================
This time, only RTNL locking reduction fallout.
 - cfg80211_dev_rename() requires RTNL
 - cfg80211_change_iface() and cfg80211_set_encryption()
   require wiphy mutex (was missing in wireless extensions)
 - cfg80211_destroy_ifaces() requires wiphy mutex
 - netdev registration can fail due to notifiers, and then
   notifiers are "unrolled", need to handle this properly

* tag 'mac80211-next-for-net-next-2021-02-02' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next:
  cfg80211: fix netdev registration deadlock
  cfg80211: call cfg80211_destroy_ifaces() with wiphy lock held
  wext: call cfg80211_set_encryption() with wiphy lock held
  wext: call cfg80211_change_iface() with wiphy lock held
  nl80211: call cfg80211_dev_rename() under RTNL
====================

Link: https://lore.kernel.org/r/20210202144106.38207-1-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-02 18:40:42 -08:00
Amit Cohen
907eea4868 net: ipv6: Emit notification when fib hardware flags are changed
After installing a route to the kernel, user space receives an
acknowledgment, which means the route was installed in the kernel,
but not necessarily in hardware.

The asynchronous nature of route installation in hardware can lead
to a routing daemon advertising a route before it was actually installed in
hardware. This can result in packet loss or mis-routed packets until the
route is installed in hardware.

It is also possible for a route already installed in hardware to change
its action and therefore its flags. For example, a host route that is
trapping packets can be "promoted" to perform decapsulation following
the installation of an IPinIP/VXLAN tunnel.

Emit RTM_NEWROUTE notifications whenever RTM_F_OFFLOAD/RTM_F_TRAP flags
are changed. The aim is to provide an indication to user-space
(e.g., routing daemons) about the state of the route in hardware.

Introduce a sysctl that controls this behavior.

Keep the default value at 0 (i.e., do not emit notifications) for several
reasons:
- Multiple RTM_NEWROUTE notification per-route might confuse existing
  routing daemons.
- Convergence reasons in routing daemons.
- The extra notifications will negatively impact the insertion rate.
- Not all users are interested in these notifications.

Move fib6_info_hw_flags_set() to C file because it is no longer a short
function.

Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-02 17:45:59 -08:00
Amit Cohen
fbaca8f895 net: Pass 'net' struct as first argument to fib6_info_hw_flags_set()
The next patch will emit notification when hardware flags are changed,
in case that fib_notify_on_flag_change sysctl is set to 1.

To know sysctl values, net struct is needed.
This change is consistent with the IPv4 version, which gets 'net' struct
as its first argument.

Currently, the only callers of this function are mlxsw and netdevsim.
Patch the callers to pass net.

Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-02 17:45:59 -08:00
Amit Cohen
680aea08e7 net: ipv4: Emit notification when fib hardware flags are changed
After installing a route to the kernel, user space receives an
acknowledgment, which means the route was installed in the kernel,
but not necessarily in hardware.

The asynchronous nature of route installation in hardware can lead to a
routing daemon advertising a route before it was actually installed in
hardware. This can result in packet loss or mis-routed packets until the
route is installed in hardware.

It is also possible for a route already installed in hardware to change
its action and therefore its flags. For example, a host route that is
trapping packets can be "promoted" to perform decapsulation following
the installation of an IPinIP/VXLAN tunnel.

Emit RTM_NEWROUTE notifications whenever RTM_F_OFFLOAD/RTM_F_TRAP flags
are changed. The aim is to provide an indication to user-space
(e.g., routing daemons) about the state of the route in hardware.

Introduce a sysctl that controls this behavior.

Keep the default value at 0 (i.e., do not emit notifications) for several
reasons:
- Multiple RTM_NEWROUTE notification per-route might confuse existing
  routing daemons.
- Convergence reasons in routing daemons.
- The extra notifications will negatively impact the insertion rate.
- Not all users are interested in these notifications.

Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Acked-by: Roopa Prabhu <roopa@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-02 17:45:59 -08:00
Jakub Kicinski
d1e1355aef Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-02 14:21:31 -08:00
Linus Torvalds
a992562872 Merge tag 'net-5.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
 "Networking fixes for 5.11-rc7, including fixes from bpf and mac80211
  trees.

  Current release - regressions:

   - ip_tunnel: fix mtu calculation

   - mlx5: fix function calculation for page trees

  Previous releases - regressions:

   - vsock: fix the race conditions in multi-transport support

   - neighbour: prevent a dead entry from updating gc_list

   - dsa: mv88e6xxx: override existent unicast portvec in port_fdb_add

  Previous releases - always broken:

   - bpf, cgroup: two copy_{from,to}_user() warn_on_once splats for BPF
     cgroup getsockopt infra when user space is trying to race against
     optlen, from Loris Reiff.

   - bpf: add missing fput() in BPF inode storage map update helper

   - udp: ipv4: manipulate network header of NATed UDP GRO fraglist

   - mac80211: fix station rate table updates on assoc

   - r8169: work around RTL8125 UDP HW bug

   - igc: report speed and duplex as unknown when device is runtime
     suspended

   - rxrpc: fix deadlock around release of dst cached on udp tunnel"

* tag 'net-5.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (36 commits)
  net: hsr: align sup_multicast_addr in struct hsr_priv to u16 boundary
  net: ipa: fix two format specifier errors
  net: ipa: use the right accessor in ipa_endpoint_status_skip()
  net: ipa: be explicit about endianness
  net: ipa: add a missing __iomem attribute
  net: ipa: pass correct dma_handle to dma_free_coherent()
  r8169: fix WoL on shutdown if CONFIG_DEBUG_SHIRQ is set
  net/rds: restrict iovecs length for RDS_CMSG_RDMA_ARGS
  net: mvpp2: TCAM entry enable should be written after SRAM data
  net: lapb: Copy the skb before sending a packet
  net/mlx5e: Release skb in case of failure in tc update skb
  net/mlx5e: Update max_opened_tc also when channels are closed
  net/mlx5: Fix leak upon failure of rule creation
  net/mlx5: Fix function calculation for page trees
  docs: networking: swap words in icmp_errors_use_inbound_ifaddr doc
  udp: ipv4: manipulate network header of NATed UDP GRO fraglist
  net: ip_tunnel: fix mtu calculation
  vsock: fix the race conditions in multi-transport support
  net: sched: replaced invalid qdisc tree flush helper in qdisc_replace
  ibmvnic: device remove has higher precedence over reset
  ...
2021-02-02 10:26:09 -08:00
Dongseok Yi
c3df39ac9b udp: ipv4: manipulate network header of NATed UDP GRO fraglist
UDP/IP header of UDP GROed frag_skbs are not updated even after NAT
forwarding. Only the header of head_skb from ip_finish_output_gso ->
skb_gso_segment is updated but following frag_skbs are not updated.

A call path skb_mac_gso_segment -> inet_gso_segment ->
udp4_ufo_fragment -> __udp_gso_segment -> __udp_gso_segment_list
does not try to update UDP/IP header of the segment list but copy
only the MAC header.

Update port, addr and check of each skb of the segment list in
__udp_gso_segment_list. It covers both SNAT and DNAT.

Fixes: 9fd1ff5d2a (udp: Support UDP fraglist GRO/GSO.)
Signed-off-by: Dongseok Yi <dseok.yi@samsung.com>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Link: https://lore.kernel.org/r/1611962007-80092-1-git-send-email-dseok.yi@samsung.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-01 20:02:16 -08:00
Alexander Ovechkin
938e0fcd32 net: sched: replaced invalid qdisc tree flush helper in qdisc_replace
Commit e5f0e8f8e4 ("net: sched: introduce and use qdisc tree flush/purge helpers")
introduced qdisc tree flush/purge helpers, but erroneously used flush helper
instead of purge helper in qdisc_replace function.
This issue was found in our CI, that tests various qdisc setups by configuring
qdisc and sending data through it. Call of invalid helper sporadically leads
to corruption of vt_tree/cf_tree of hfsc_class that causes kernel oops:

 Oops: 0000 [#1] SMP PTI
 CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.11.0-8f6859df #1
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
 RIP: 0010:rb_insert_color+0x18/0x190
 Code: c3 31 c0 c3 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00 00 48 8b 07 48 85 c0 0f 84 05 01 00 00 48 8b 10 f6 c2 01 0f 85 34 01 00 00 <48> 8b 4a 08 49 89 d0 48 39 c1 74 7d 48 85 c9 74 32 f6 01 01 75 2d
 RSP: 0018:ffffc900000b8bb0 EFLAGS: 00010246
 RAX: ffff8881ef4c38b0 RBX: ffff8881d956e400 RCX: ffff8881ef4c38b0
 RDX: 0000000000000000 RSI: ffff8881d956f0a8 RDI: ffff8881d956e4b0
 RBP: 0000000000000000 R08: 000000d5c4e249da R09: 1600000000000000
 R10: ffffc900000b8be0 R11: ffffc900000b8b28 R12: 0000000000000001
 R13: 000000000000005a R14: ffff8881f0905000 R15: ffff8881f0387d00
 FS:  0000000000000000(0000) GS:ffff8881f8b00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000000008 CR3: 00000001f4796004 CR4: 0000000000060ee0
 Call Trace:
  <IRQ>
  init_vf.isra.19+0xec/0x250 [sch_hfsc]
  hfsc_enqueue+0x245/0x300 [sch_hfsc]
  ? fib_rules_lookup+0x12a/0x1d0
  ? __dev_queue_xmit+0x4b6/0x930
  ? hfsc_delete_class+0x250/0x250 [sch_hfsc]
  __dev_queue_xmit+0x4b6/0x930
  ? ip6_finish_output2+0x24d/0x590
  ip6_finish_output2+0x24d/0x590
  ? ip6_output+0x6c/0x130
  ip6_output+0x6c/0x130
  ? __ip6_finish_output+0x110/0x110
  mld_sendpack+0x224/0x230
  mld_ifc_timer_expire+0x186/0x2c0
  ? igmp6_group_dropped+0x200/0x200
  call_timer_fn+0x2d/0x150
  run_timer_softirq+0x20c/0x480
  ? tick_sched_do_timer+0x60/0x60
  ? tick_sched_timer+0x37/0x70
  __do_softirq+0xf7/0x2cb
  irq_exit+0xa0/0xb0
  smp_apic_timer_interrupt+0x74/0x150
  apic_timer_interrupt+0xf/0x20
  </IRQ>

Fixes: e5f0e8f8e4 ("net: sched: introduce and use qdisc tree flush/purge helpers")
Signed-off-by: Alexander Ovechkin <ovov@yandex-team.ru>
Reported-by: Alexander Kuznetsov <wwfq@yandex-team.ru>
Acked-by: Dmitry Monakhov <dmtrmonakhov@yandex-team.ru>
Acked-by: Dmitry Yakunin <zeil@yandex-team.ru>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Link: https://lore.kernel.org/r/20210201200049.299153-1-ovov@yandex-team.ru
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-01 18:40:18 -08:00
Linus Torvalds
88bb507a74 Merge tag 'media/v5.11-3' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media
Pull media fixes from Mauro Carvalho Chehab:
 "The rockship rkisp1 driver will be promoted from staging in 5.11.

  While not too late, do a few uAPI changes which are needed to better
  support its functionalities"

* tag 'media/v5.11-3' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media:
  media: rockchip: rkisp1: extend uapi array sizes
  media: rockchip: rkisp1: carry ip version information
  media: rockchip: rkisp1: reduce number of histogram grid elements in uapi
  media: rkisp1: stats: mask the hist_bins values
  media: rkisp1: stats: remove a wrong cast to u8
  media: rkisp1: uapi: change hist_bins array type from __u16 to __u32
2021-02-01 11:15:57 -08:00
Johannes Berg
40c575d1ec cfg80211: fix netdev registration deadlock
If register_netdevice() fails after having called cfg80211's
netdev notifier (cfg80211_netdev_notifier_call) it will call
the notifier again with UNREGISTER. This would then lock the
wiphy mutex because we're marked as registered, which causes
a deadlock.

Fix this by separately keeping track of whether or not we're
in the middle of registering to also skip the notifier call
on this unregister.

Reported-by: syzbot+2ae0ca9d7737ad1a62b7@syzkaller.appspotmail.com
Fixes: a05829a722 ("cfg80211: avoid holding the RTNL when calling the driver")
Link: https://lore.kernel.org/r/20210201192048.ed8bad436737.I7cae042c44b15f80919a285799a15df467e9d42d@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-02-01 19:30:54 +01:00
Linus Torvalds
f5a376edde Merge tag 'x86_entry_for_v5.11_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fix from Borislav Petkov:
 "A single fix for objtool to generate proper unwind info for newer
  toolchains which do not generate section symbols anymore. And a
  cleanup ontop.

  This was originally going to go during the next merge window but
  people can already trigger a build error with binutils-2.36 which
  doesn't emit section symbols - something which objtool relies on - so
  let's expedite it"

* tag 'x86_entry_for_v5.11_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/entry: Remove put_ret_addr_in_rdi THUNK macro argument
  x86/entry: Emit a symbol for register restoring thunk
2021-01-31 11:48:12 -08:00
Linus Torvalds
c178fae3a9 Merge tag 'nfs-for-5.11-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client fixes from Trond Myklebust:

 - SUNRPC: Handle 0 length opaque XDR object data properly

 - Fix a layout segment leak in pnfs_layout_process()

 - pNFS/NFSv4: Update the layout barrier when we schedule a layoutreturn

 - pNFS/NFSv4: Improve rejection of out-of-order layouts

 - pNFS/NFSv4: Try to return invalid layout in pnfs_layout_process()

* tag 'nfs-for-5.11-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  SUNRPC: Handle 0 length opaque XDR object data properly
  SUNRPC: Move simple_get_bytes and simple_get_netobj into private header
  pNFS/NFSv4: Improve rejection of out-of-order layouts
  pNFS/NFSv4: Update the layout barrier when we schedule a layoutreturn
  pNFS/NFSv4: Try to return invalid layout in pnfs_layout_process()
  pNFS/NFSv4: Fix a layout segment leak in pnfs_layout_process()
2021-01-31 11:19:12 -08:00
Neal Cardwell
14e8e0f600 tcp: shrink inet_connection_sock icsk_mtup enabled and probe_size
This commit shrinks inet_connection_sock by 4 bytes, by shrinking
icsk_mtup.enabled from 32 bits to 1 bit, and shrinking
icsk_mtup.probe_size from s32 to an unsuigned 31 bit field.

This is to save space to compensate for the recent introduction of a
new u32 in inet_connection_sock, icsk_probes_tstamp, in the recent bug
fix commit 9d9b1ee0b2 ("tcp: fix TCP_USER_TIMEOUT with zero window").

This should not change functionality, since icsk_mtup.enabled is only
ever set to 0 or 1, and icsk_mtup.probe_size can only be either 0
or a positive MTU value returned by tcp_mss_to_mtu()

Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20210129185438.1813237-1-ncardwell.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-29 22:48:30 -08:00
Vladimir Oltean
e21268efbe net: dsa: felix: perform switch setup for tag_8021q
Unlike sja1105, the only other user of the software-defined tag_8021q.c
tagger format, the implementation we choose for the Felix DSA switch
driver preserves full functionality under a vlan_filtering bridge
(i.e. IP termination works through the DSA user ports under all
circumstances).

The tag_8021q protocol just wants:
- Identifying the ingress switch port based on the RX VLAN ID, as seen
  by the CPU. We achieve this by using the TCAM engines (which are also
  used for tc-flower offload) to push the RX VLAN as a second, outer
  tag, on egress towards the CPU port.
- Steering traffic injected into the switch from the network stack
  towards the correct front port based on the TX VLAN, and consuming
  (popping) that header on the switch's egress.

A tc-flower pseudocode of the static configuration done by the driver
would look like this:

$ tc qdisc add dev <cpu-port> clsact
$ for eth in swp0 swp1 swp2 swp3; do \
	tc filter add dev <cpu-port> egress flower indev ${eth} \
		action vlan push id <rxvlan> protocol 802.1ad; \
	tc filter add dev <cpu-port> ingress protocol 802.1Q flower
		vlan_id <txvlan> action vlan pop \
		action mirred egress redirect dev ${eth}; \
done

but of course since DSA does not register network interfaces for the CPU
port, this configuration would be impossible for the user to do. Also,
due to the same reason, it is impossible for the user to inadvertently
delete these rules using tc. These rules do not collide in any way with
tc-flower, they just consume some TCAM space, which is something we can
live with.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-29 21:25:27 -08:00
Vladimir Oltean
7c83a7c539 net: dsa: add a second tagger for Ocelot switches based on tag_8021q
There are use cases for which the existing tagger, based on the NPI
(Node Processor Interface) functionality, is insufficient.

Namely:
- Frames injected through the NPI port bypass the frame analyzer, so no
  source address learning is performed, no TSN stream classification,
  etc.
- Flow control is not functional over an NPI port (PAUSE frames are
  encapsulated in the same Extraction Frame Header as all other frames)
- There can be at most one NPI port configured for an Ocelot switch. But
  in NXP LS1028A and T1040 there are two Ethernet CPU ports. The non-NPI
  port is currently either disabled, or operated as a plain user port
  (albeit an internally-facing one). Having the ability to configure the
  two CPU ports symmetrically could pave the way for e.g. creating a LAG
  between them, to increase bandwidth seamlessly for the system.

So there is a desire to have an alternative to the NPI mode. This change
keeps the default tagger for the Seville and Felix switches as "ocelot",
but it can be changed via the following device attribute:

echo ocelot-8021q > /sys/class/<dsa-master>/dsa/tagging

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-29 21:25:27 -08:00
Vladimir Oltean
53da0ebaad net: dsa: allow changing the tag protocol via the "tagging" device attribute
Currently DSA exposes the following sysfs:
$ cat /sys/class/net/eno2/dsa/tagging
ocelot

which is a read-only device attribute, introduced in the kernel as
commit 98cdb48071 ("net: dsa: Expose tagging protocol to user-space"),
and used by libpcap since its commit 993db3800d7d ("Add support for DSA
link-layer types").

It would be nice if we could extend this device attribute by making it
writable:
$ echo ocelot-8021q > /sys/class/net/eno2/dsa/tagging

This is useful with DSA switches that can make use of more than one
tagging protocol. It may be useful in dsa_loop in the future too, to
perform offline testing of various taggers, or for changing between dsa
and edsa on Marvell switches, if that is desirable.

In terms of implementation, drivers can support this feature by
implementing .change_tag_protocol, which should always leave the switch
in a consistent state: either with the new protocol if things went well,
or with the old one if something failed. Teardown of the old protocol,
if necessary, must be handled by the driver.

Some things remain as before:
- The .get_tag_protocol is currently only called at probe time, to load
  the initial tagging protocol driver. Nonetheless, new drivers should
  report the tagging protocol in current use now.
- The driver should manage by itself the initial setup of tagging
  protocol, no later than the .setup() method, as well as destroying
  resources used by the last tagger in use, no earlier than the
  .teardown() method.

For multi-switch DSA trees, error handling is a bit more complicated,
since e.g. the 5th out of 7 switches may fail to change the tag
protocol. When that happens, a revert to the original tag protocol is
attempted, but that may fail too, leaving the tree in an inconsistent
state despite each individual switch implementing .change_tag_protocol
transactionally. Since the intersection between drivers that implement
.change_tag_protocol and drivers that support D in DSA is currently the
empty set, the possibility for this error to happen is ignored for now.

Testing:

$ insmod mscc_felix.ko
[   79.549784] mscc_felix 0000:00:00.5: Adding to iommu group 14
[   79.565712] mscc_felix 0000:00:00.5: Failed to register DSA switch: -517
$ insmod tag_ocelot.ko
$ rmmod mscc_felix.ko
$ insmod mscc_felix.ko
[   97.261724] libphy: VSC9959 internal MDIO bus: probed
[   97.267363] mscc_felix 0000:00:00.5: Found PCS at internal MDIO address 0
[   97.274998] mscc_felix 0000:00:00.5: Found PCS at internal MDIO address 1
[   97.282561] mscc_felix 0000:00:00.5: Found PCS at internal MDIO address 2
[   97.289700] mscc_felix 0000:00:00.5: Found PCS at internal MDIO address 3
[   97.599163] mscc_felix 0000:00:00.5 swp0 (uninitialized): PHY [0000:00:00.3:10] driver [Microsemi GE VSC8514 SyncE] (irq=POLL)
[   97.862034] mscc_felix 0000:00:00.5 swp1 (uninitialized): PHY [0000:00:00.3:11] driver [Microsemi GE VSC8514 SyncE] (irq=POLL)
[   97.950731] mscc_felix 0000:00:00.5 swp0: configuring for inband/qsgmii link mode
[   97.964278] 8021q: adding VLAN 0 to HW filter on device swp0
[   98.146161] mscc_felix 0000:00:00.5 swp2 (uninitialized): PHY [0000:00:00.3:12] driver [Microsemi GE VSC8514 SyncE] (irq=POLL)
[   98.238649] mscc_felix 0000:00:00.5 swp1: configuring for inband/qsgmii link mode
[   98.251845] 8021q: adding VLAN 0 to HW filter on device swp1
[   98.433916] mscc_felix 0000:00:00.5 swp3 (uninitialized): PHY [0000:00:00.3:13] driver [Microsemi GE VSC8514 SyncE] (irq=POLL)
[   98.485542] mscc_felix 0000:00:00.5: configuring for fixed/internal link mode
[   98.503584] mscc_felix 0000:00:00.5: Link is Up - 2.5Gbps/Full - flow control rx/tx
[   98.527948] device eno2 entered promiscuous mode
[   98.544755] DSA: tree 0 setup

$ ping 10.0.0.1
PING 10.0.0.1 (10.0.0.1): 56 data bytes
64 bytes from 10.0.0.1: seq=0 ttl=64 time=2.337 ms
64 bytes from 10.0.0.1: seq=1 ttl=64 time=0.754 ms
^C
 -  10.0.0.1 ping statistics  -
2 packets transmitted, 2 packets received, 0% packet loss
round-trip min/avg/max = 0.754/1.545/2.337 ms

$ cat /sys/class/net/eno2/dsa/tagging
ocelot
$ cat ./test_ocelot_8021q.sh
        #!/bin/bash

        ip link set swp0 down
        ip link set swp1 down
        ip link set swp2 down
        ip link set swp3 down
        ip link set swp5 down
        ip link set eno2 down
        echo ocelot-8021q > /sys/class/net/eno2/dsa/tagging
        ip link set eno2 up
        ip link set swp0 up
        ip link set swp1 up
        ip link set swp2 up
        ip link set swp3 up
        ip link set swp5 up
$ ./test_ocelot_8021q.sh
./test_ocelot_8021q.sh: line 9: echo: write error: Protocol not available
$ rmmod tag_ocelot.ko
rmmod: can't unload module 'tag_ocelot': Resource temporarily unavailable
$ insmod tag_ocelot_8021q.ko
$ ./test_ocelot_8021q.sh
$ cat /sys/class/net/eno2/dsa/tagging
ocelot-8021q
$ rmmod tag_ocelot.ko
$ rmmod tag_ocelot_8021q.ko
rmmod: can't unload module 'tag_ocelot_8021q': Resource temporarily unavailable
$ ping 10.0.0.1
PING 10.0.0.1 (10.0.0.1): 56 data bytes
64 bytes from 10.0.0.1: seq=0 ttl=64 time=0.953 ms
64 bytes from 10.0.0.1: seq=1 ttl=64 time=0.787 ms
64 bytes from 10.0.0.1: seq=2 ttl=64 time=0.771 ms
$ rmmod mscc_felix.ko
[  645.544426] mscc_felix 0000:00:00.5: Link is Down
[  645.838608] DSA: tree 0 torn down
$ rmmod tag_ocelot_8021q.ko

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-29 21:24:39 -08:00
Vladimir Oltean
357f203bb3 net: dsa: keep a copy of the tagging protocol in the DSA switch tree
Cascading DSA switches can be done multiple ways. There is the brute
force approach / tag stacking, where one upstream switch, located
between leaf switches and the host Ethernet controller, will just
happily transport the DSA header of those leaf switches as payload.
For this kind of setups, DSA works without any special kind of treatment
compared to a single switch - they just aren't aware of each other.
Then there's the approach where the upstream switch understands the tags
it transports from its leaves below, as it doesn't push a tag of its own,
but it routes based on the source port & switch id information present
in that tag (as opposed to DMAC & VID) and it strips the tag when
egressing a front-facing port. Currently only Marvell implements the
latter, and Marvell DSA trees contain only Marvell switches.

So it is safe to say that DSA trees already have a single tag protocol
shared by all switches, and in fact this is what makes the switches able
to understand each other. This fact is also implied by the fact that
currently, the tagging protocol is reported as part of a sysfs installed
on the DSA master and not per port, so it must be the same for all the
ports connected to that DSA master regardless of the switch that they
belong to.

It's time to make this official and enforce it (yes, this also means we
won't have any "switch understands tag to some extent but is not able to
speak it" hardware oddities that we'll support in the future).

This is needed due to the imminent introduction of the dsa_switch_ops::
change_tag_protocol driver API. When that is introduced, we'll have
to notify switches of the tagging protocol that they're configured to
use. Currently the tag_ops structure pointer is held only for CPU ports.
But there are switches which don't have CPU ports and nonetheless still
need to be configured. These would be Marvell leaf switches whose
upstream port is just a DSA link. How do we inform these of their
tagging protocol setup/deletion?

One answer to the above would be: iterate through the DSA switch tree's
ports once, list the CPU ports, get their tag_ops, then iterate again
now that we have it, and notify everybody of that tag_ops. But what to
do if conflicts appear between one cpu_dp->tag_ops and another? There's
no escaping the fact that conflict resolution needs to be done, so we
can be upfront about it.

Ease our work and just keep the master copy of the tag_ops inside the
struct dsa_switch_tree. Reference counting is now moved to be per-tree
too, instead of per-CPU port.

There are many places in the data path that access master->dsa_ptr->tag_ops
and we would introduce unnecessary performance penalty going through yet
another indirection, so keep those right where they are.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-29 21:24:31 -08:00