Commit Graph

109104 Commits

Author SHA1 Message Date
wenxu
18b6f71748 openvswitch: Make metadata_dst tunnel work in IP_TUNNEL_INFO_BRIDGE mode
There is currently no support for the multicast/broadcast aspects
of VXLAN in ovs. In the datapath flow the tun_dst must specific.
But in the IP_TUNNEL_INFO_BRIDGE mode the tun_dst can not be specific.
And the packet can forward through the fdb table of vxlan devcice. In
this mode the broadcast/multicast packet can be sent through the
following ways in ovs.

ovs-vsctl add-port br0 vxlan -- set in vxlan type=vxlan \
        options:key=1000 options:remote_ip=flow
ovs-ofctl add-flow br0 in_port=LOCAL,dl_dst=ff:ff:ff:ff:ff:ff, \
        action=output:vxlan

bridge fdb append ff:ff:ff:ff:ff:ff dev vxlan_sys_4789 dst 172.168.0.1 \
        src_vni 1000 vni 1000 self
bridge fdb append ff:ff:ff:ff:ff:ff dev vxlan_sys_4789 dst 172.168.0.2 \
        src_vni 1000 vni 1000 self

Signed-off-by: wenxu <wenxu@ucloud.cn>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-29 13:29:15 -07:00
David Ahern
3616d08bcb ipv6: Move ipv6 stubs to a separate header file
The number of stubs is growing and has nothing to do with addrconf.
Move the definition of the stubs to a separate header file and update
users. In the move, drop the vxlan specific comment before ipv6_stub.

Code move only; no functional change intended.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-29 10:53:45 -07:00
David Ahern
979e276ebe net: Use common nexthop init and release helpers
With fib_nh_common in place, move common initialization and release
code into helpers used by both ipv4 and ipv6. For the moment, the init
is just the lwt encap and the release is both the netdev reference and
the the lwt state reference. More will be added later.

Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-29 10:48:04 -07:00
David Ahern
f1741730dd net: Add fib_nh_common and update fib_nh and fib6_nh
Add fib_nh_common struct with common nexthop attributes. Convert
fib_nh and fib6_nh to use it. Use macros to move existing
fib_nh_* references to the new nh_common.nhc_*.

Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-29 10:48:04 -07:00
David Ahern
ad1601ae02 ipv6: Rename fib6_nh entries
Rename fib6_nh entries that will be moved to a fib_nh_common struct.
Specifically, the device, gateway, flags, and lwtstate are common
with all nexthop definitions. In some places new temporary variables
are declared or local variables renamed to maintain line lengths.

Rename only; no functional change intended.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-29 10:48:04 -07:00
David Ahern
b75ed8b1aa ipv4: Rename fib_nh entries
Rename fib_nh entries that will be moved to a fib_nh_common struct.
Specifically, the device, oif, gateway, flags, scope, lwtstate,
nh_weight and nh_upper_bound are common with all nexthop definitions.
In the process shorten fib_nh_lwtstate to fib_nh_lws to avoid really
long lines.

Rename only; no functional change intended.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-29 10:48:04 -07:00
David Ahern
6d3d07b45c ipv6: Refactor fib6_ignore_linkdown
fib6_ignore_linkdown takes a fib6_info but only looks at the net_device
and its IPv6 config. Change it to take a net_device over a fib6_info as
its input argument.

In addition, move it to a header file to make the check inline and usable
later with IPv4 code without going through the ipv6 stub, and rename to
ip6_ignore_linkdown since it is only checking the setting based on the
ipv6 struct on a device.

Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-29 10:48:04 -07:00
David Ahern
2b2450ca4a ipv6: Move gateway checks to a fib6_nh setting
The gateway setting is not per fib6_info entry but per-fib6_nh. Add a new
fib_nh_has_gw flag to fib6_nh and convert references to RTF_GATEWAY to
the new flag. For IPv6 address the flag is cheaper than checking that
nh_gw is non-0 like IPv4 does.

While this increases fib6_nh by 8-bytes, the effective allocation size of
a fib6_info is unchanged. The 8 bytes is recovered later with a
fib_nh_common change.

Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-29 10:48:03 -07:00
David Ahern
dac7d0f270 ipv6: Create cleanup helper for fib6_nh
Move the fib6_nh cleanup code to a new helper, fib6_nh_release.

Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-29 10:48:03 -07:00
David Ahern
83c4425159 ipv6: Create init helper for fib6_nh
Similar to IPv4, consolidate the fib6_nh initialization into a helper.
As a new standalone function, add a cleanup path to put lwtstate on
error.

To avoid modifying fib6_config flags, move the reject check to a helper
that is invoked once by fib6_nh_init to reset the device and then
again in ip6_route_info_create to set the fib6_flags.

Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-29 10:48:03 -07:00
David Ahern
faa041a40b ipv4: Create cleanup helper for fib_nh
Move the fib_nh cleanup code from free_fib_info_rcu into a new helper,
fib_nh_release. Move classid accounting into fib_nh_release which is
called per fib_nh to make accounting symmetrical with fib_nh_init.
Export the helper to allow for use with nexthop objects in the
future.

Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-29 10:48:03 -07:00
David Ahern
e4516ef654 ipv4: Create init helper for fib_nh
Consolidate the fib_nh initialization which is duplicated between
fib_create_info for single path and fib_get_nhs for multipath.
Export the helper to allow for use with nexthop objects in the
future.

Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-29 10:48:03 -07:00
David Ahern
331c7a4023 ipv4: Move IN_DEV_IGNORE_ROUTES_WITH_LINKDOWN to helper
in_dev lookup followed by IN_DEV_IGNORE_ROUTES_WITH_LINKDOWN check
is called in several places, some with the rcu lock and others with the
rtnl held.

Move the check to a helper similar to what IPv6 has. Since the helper
can be invoked from either context use rcu_dereference_rtnl to
dereference ip_ptr.

Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-29 10:48:03 -07:00
Yi-Hung Wei
06bd2bdf19 openvswitch: Add timeout support to ct action
Add support for fine-grain timeout support to conntrack action.
The new OVS_CT_ATTR_TIMEOUT attribute of the conntrack action
specifies a timeout to be associated with this connection.
If no timeout is specified, it acts as is, that is the default
timeout for the connection will be automatically applied.

Example usage:
$ nfct timeout add timeout_1 inet tcp syn_sent 100 established 200
$ ovs-ofctl add-flow br0 in_port=1,ip,tcp,action=ct(commit,timeout=timeout_1)

CC: Pravin Shelar <pshelar@ovn.org>
CC: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Yi-Hung Wei <yihung.wei@gmail.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-28 16:53:29 -07:00
Yi-Hung Wei
717700d183 netfilter: Export nf_ct_{set,destroy}_timeout()
This patch exports nf_ct_set_timeout() and nf_ct_destroy_timeout().
The two functions are derived from xt_ct_destroy_timeout() and
xt_ct_set_timeout() in xt_CT.c, and moved to nf_conntrack_timeout.c
without any functional change.
It would be useful for other users (i.e. OVS) that utilizes the
finer-grain conntrack timeout feature.

CC: Pablo Neira Ayuso <pablo@netfilter.org>
CC: Pravin Shelar <pshelar@ovn.org>
Signed-off-by: Yi-Hung Wei <yihung.wei@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-28 16:53:29 -07:00
Jiri Pirko
14c03ac4c1 net: devlink: remove unused devlink_port_get_phys_port_name() function
Now it is unused, remove it.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-28 12:55:30 -07:00
Jiri Pirko
af3836df9a net: devlink: introduce devlink_compat_phys_port_name_get()
Introduce devlink_compat_phys_port_name_get() helper that
gets the physical port name for specified netdevice
according to devlink port attributes.
Call this helper from dev_get_phys_port_name()
in case ndo_get_phys_port_name is not defined.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-28 12:55:30 -07:00
Jiri Pirko
5dc37bb9b0 net: replace ndo_get_devlink with ndo_get_devlink_port
Follow-up patch is going to need a devlink port instance according to
a netdev. Devlink port instance should be always available when devlink
is used. So change the recently introduced ndo_get_devlink to
ndo_get_devlink_port. With that, adjust the wrapper for the only
user to get devlink pointer.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Michal Kubecek <mkubecek@suse.cz>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-28 12:55:30 -07:00
David S. Miller
ede1fd1851 Merge tag 'batadv-next-for-davem-20190328' of git://git.open-mesh.org/linux-merge
Simon Wunderlich says:

====================
This feature/cleanup patchset includes the following patches:

 - Drop license boilerplate (obsoleted by SPDX license IDs),
   by Sven Eckelmann

 - Drop documentation for sysfs and debugfs Documentation,
   by Sven Eckelmann (2 patches)

 - Mark sysfs as optional and deprecated, by Sven Eckelmann (3 patches)

 - Update MAINTAINERS Tree, Chat and Bugtracker,
   by Sven Eckelmann (3 patches)

 - Rename batadv_dat_send_data, by Sven Eckelmann

 - update DAT entries with incoming ARP replies, by Linus Luessing

 - add multicast-to-unicast support for limited destinations,
   by Linus Luessing
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-28 09:52:42 -07:00
David S. Miller
356d71e00d Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2019-03-27 17:37:58 -07:00
Eric Dumazet
df453700e8 inet: switch IP ID generator to siphash
According to Amit Klein and Benny Pinkas, IP ID generation is too weak
and might be used by attackers.

Even with recent net_hash_mix() fix (netns: provide pure entropy for net_hash_mix())
having 64bit key and Jenkins hash is risky.

It is time to switch to siphash and its 128bit keys.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Amit Klein <aksecurity@gmail.com>
Reported-by: Benny Pinkas <benny@pinkas.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-27 14:29:26 -07:00
Eric Dumazet
4f661542a4 tcp: fix zerocopy and notsent_lowat issues
My recent patch had at least three problems :

1) TX zerocopy wants notification when skb is acknowledged,
   thus we need to call skb_zcopy_clear() if the skb is
   cached into sk->sk_tx_skb_cache

2) Some applications might expect precise EPOLLOUT
   notifications, so we need to update sk->sk_wmem_queued
   and call sk_mem_uncharge() from sk_wmem_free_skb()
   in all cases. The SOCK_QUEUE_SHRUNK flag must also be set.

3) Reuse of saved skb should have used skb_cloned() instead
  of simply checking if the fast clone has been freed.

Fixes: 472c2e07ee ("tcp: add one skb cache for tx")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-27 13:59:02 -07:00
Numan Siddique
4d5ec89fc8 net: openvswitch: Add a new action check_pkt_len
This patch adds a new action - 'check_pkt_len' which checks the
packet length and executes a set of actions if the packet
length is greater than the specified length or executes
another set of actions if the packet length is lesser or equal to.

This action takes below nlattrs
  * OVS_CHECK_PKT_LEN_ATTR_PKT_LEN - 'pkt_len' to check for

  * OVS_CHECK_PKT_LEN_ATTR_ACTIONS_IF_GREATER - Nested actions
    to apply if the packet length is greater than the specified 'pkt_len'

  * OVS_CHECK_PKT_LEN_ATTR_ACTIONS_IF_LESS_EQUAL - Nested
    actions to apply if the packet length is lesser or equal to the
    specified 'pkt_len'.

The main use case for adding this action is to solve the packet
drops because of MTU mismatch in OVN virtual networking solution.
When a VM (which belongs to a logical switch of OVN) sends a packet
destined to go via the gateway router and if the nic which provides
external connectivity, has a lesser MTU, OVS drops the packet
if the packet length is greater than this MTU.

With the help of this action, OVN will check the packet length
and if it is greater than the MTU size, it will generate an
ICMP packet (type 3, code 4) and includes the next hop mtu in it
so that the sender can fragment the packets.

Reported-at:
https://mail.openvswitch.org/pipermail/ovs-discuss/2018-July/047039.html
Suggested-by: Ben Pfaff <blp@ovn.org>
Signed-off-by: Numan Siddique <nusiddiq@redhat.com>
CC: Gregory Rose <gvrose8192@gmail.com>
CC: Pravin B Shelar <pshelar@ovn.org>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Tested-by: Greg Rose <gvrose8192@gmail.com>
Reviewed-by: Greg Rose <gvrose8192@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-27 13:53:23 -07:00
Heiner Kallweit
3aeb0803f7 ethtool: add PHY Fast Link Down support
This adds support for Fast Link Down as new PHY tunable.
Fast Link Down reduces the time until a link down event is reported
for 1000BaseT. According to the standard it's 750ms what is too long
for several use cases.

v2:
- add comment describing the constants

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-27 13:51:49 -07:00
Kristian Evensen
1713cb37bf fou: Support binding FoU socket
An FoU socket is currently bound to the wildcard-address. While this
works fine, there are several use-cases where the use of the
wildcard-address is not desirable. For example, I use FoU on some
multi-homed servers and would like to use FoU on only one of the
interfaces.

This commit adds support for binding FoU sockets to a given source
address/interface, as well as connecting the socket to a given
destination address/port. udp_tunnel already provides the required
infrastructure, so most of the code added is for exposing and setting
the different attributes (local address, peer address, etc.).

The lookups performed when we add, delete or get an FoU-socket has also
been updated to compare all the attributes a user can set. Since the
comparison now involves several elements, I have added a separate
comparison-function instead of open-coding.

In order to test the code and ensure that the new comparison code works
correctly, I started by creating a wildcard socket bound to port 1234 on
my machine. I then tried to create a non-wildcarded socket bound to the
same port, as well as fetching and deleting the socket (including source
address, peer address or interface index in the netlink request).  Both
the create, fetch and delete request failed. Deleting/fetching the
socket was only successful when my netlink request attributes matched
those used to create the socket.

I then repeated the tests, but with a socket bound to a local ip
address, a socket bound to a local address + interface, and a bound
socket that was also «connected» to a peer. Add only worked when no
socket with the matching source address/interface (or wildcard) existed,
while fetch/delete was only successful when all attributes matched.

In addition to testing that the new code work, I also checked that the
current behavior is kept. If none of the new attributes are provided,
then an FoU-socket is configured as before (i.e., wildcarded).  If any
of the new attributes are provided, the FoU-socket is configured as
expected.

v1->v2:
* Fixed building with IPv6 disabled (kbuild).
* Fixed a return type warning and make the ugly comparison function more
readable (kbuild).
* Describe more in detail what has been tested (thanks David Miller).
* Make peer port required if peer address is specified.

Signed-off-by: Kristian Evensen <kristian.evensen@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-27 13:30:07 -07:00
Linus Torvalds
1a9df9e29c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:
 "Fixes here and there, a couple new device IDs, as usual:

   1) Fix BQL race in dpaa2-eth driver, from Ioana Ciornei.

   2) Fix 64-bit division in iwlwifi, from Arnd Bergmann.

   3) Fix documentation for some eBPF helpers, from Quentin Monnet.

   4) Some UAPI bpf header sync with tools, also from Quentin Monnet.

   5) Set descriptor ownership bit at the right time for jumbo frames in
      stmmac driver, from Aaro Koskinen.

   6) Set IFF_UP properly in tun driver, from Eric Dumazet.

   7) Fix load/store doubleword instruction generation in powerpc eBPF
      JIT, from Naveen N. Rao.

   8) nla_nest_start() return value checks all over, from Kangjie Lu.

   9) Fix asoc_id handling in SCTP after the SCTP_*_ASSOC changes this
      merge window. From Marcelo Ricardo Leitner and Xin Long.

  10) Fix memory corruption with large MTUs in stmmac, from Aaro
      Koskinen.

  11) Do not use ipv4 header for ipv6 flows in TCP and DCCP, from Eric
      Dumazet.

  12) Fix topology subscription cancellation in tipc, from Erik Hugne.

  13) Memory leak in genetlink error path, from Yue Haibing.

  14) Valid control actions properly in packet scheduler, from Davide
      Caratti.

  15) Even if we get EEXIST, we still need to rehash if a shrink was
      delayed. From Herbert Xu.

  16) Fix interrupt mask handling in interrupt handler of r8169, from
      Heiner Kallweit.

  17) Fix leak in ehea driver, from Wen Yang"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (168 commits)
  dpaa2-eth: fix race condition with bql frame accounting
  chelsio: use BUG() instead of BUG_ON(1)
  net: devlink: skip info_get op call if it is not defined in dumpit
  net: phy: bcm54xx: Encode link speed and activity into LEDs
  tipc: change to check tipc_own_id to return in tipc_net_stop
  net: usb: aqc111: Extend HWID table by QNAP device
  net: sched: Kconfig: update reference link for PIE
  net: dsa: qca8k: extend slave-bus implementations
  net: dsa: qca8k: remove leftover phy accessors
  dt-bindings: net: dsa: qca8k: support internal mdio-bus
  dt-bindings: net: dsa: qca8k: fix example
  net: phy: don't clear BMCR in genphy_soft_reset
  bpf, libbpf: clarify bump in libbpf version info
  bpf, libbpf: fix version info and add it to shared object
  rxrpc: avoid clang -Wuninitialized warning
  tipc: tipc clang warning
  net: sched: fix cleanup NULL pointer exception in act_mirr
  r8169: fix cable re-plugging issue
  net: ethernet: ti: fix possible object reference leak
  net: ibm: fix possible object reference leak
  ...
2019-03-27 12:22:57 -07:00
David S. Miller
5133a4a800 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says:

====================
pull-request: bpf-next 2019-03-26

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) introduce bpf_tcp_check_syncookie() helper for XDP and tc, from Lorenz.

2) allow bpf_skb_ecn_set_ce() in tc, from Peter.

3) numerous bpf tc tunneling improvements, from Willem.

4) and other miscellaneous improvements from Adrian, Alan, Daniel, Ivan, Stanislav.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-26 21:44:13 -07:00
Vladimir Oltean
450895d04b net: phy: bcm54xx: Encode link speed and activity into LEDs
Previously the green and amber LEDs on this quad PHY were solid, to
indicate an encoding of the link speed (10/100/1000).

This keeps the LEDs always on just as before, but now they flash on
Rx/Tx activity.

Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-26 11:24:47 -07:00
Linus Torvalds
a3ac7917b7 Revert "parport: daisy: use new parport device model"
This reverts commit 1aec421120.

Steven Rostedt reports that it causes a hang at bootup and bisected it
to this commit.

The troigger is apparently a module alias for "parport_lowlevel" that
points to "parport_pc", which causes a hang with

    modprobe -q -- parport_lowlevel

blocking forever with a backtrace like this:

    wait_for_completion_killable+0x1c/0x28
    call_usermodehelper_exec+0xa7/0x108
    __request_module+0x351/0x3d8
    get_lowlevel_driver+0x28/0x41 [parport]
    __parport_register_driver+0x39/0x1f4 [parport]
    daisy_drv_init+0x31/0x4f [parport]
    parport_bus_init+0x5d/0x7b [parport]
    parport_default_proc_register+0x26/0x1000 [parport]
    do_one_initcall+0xc2/0x1e0
    do_init_module+0x50/0x1d4
    load_module+0x1c2e/0x21b3
    sys_init_module+0xef/0x117

Supid says:
 "Due to the new device model daisy driver will now try to find the
  parallel ports while trying to register its driver so that it can bind
  with them. Now, since daisy driver is loaded while parport bus is
  initialising the list of parport is still empty and it tries to load
  the lowlevel driver, which has an alias set to parport_pc, now causes
  a deadlock"

But I don't think the daisy driver should be loaded by the parport
initialization in the first place, so let's revert the whole change.

If the daisy driver can just initialize separately on its own (like a
driver should), instead of hooking into the parport init sequence
directly, this issue probably would go away.

Reported-and-bisected-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Reported-by: Michal Kubecek <mkubecek@suse.cz>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Sudip Mukherjee <sudipm.mukherjee@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-25 14:49:00 -07:00
Linus Lüssing
32e727449c batman-adv: Add multicast-to-unicast support for multiple targets
With this patch multicast packets with a limited number of destinations
(current default: 16) will be split and transmitted by the originator as
individual unicast transmissions.

Wifi broadcasts with their low bitrate are still a costly undertaking.
In a mesh network this cost multiplies with the overall size of the mesh
network. Therefore using multiple unicast transmissions instead of
broadcast flooding is almost always less burdensome for the mesh
network.

The maximum amount of unicast packets can be configured via the newly
introduced multicast_fanout parameter. If this limit is exceeded
distribution will fall back to classic broadcast flooding.

The multicast-to-unicast conversion is performed on the initial
multicast sender node and counts on a final destination node, mesh-wide
basis (and not next hop, neighbor node basis).

Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2019-03-25 10:01:13 +01:00
Sven Eckelmann
0d5f20c42b batman-adv: Drop license boilerplate
All files got a SPDX-License-Identifier with commit 7db7d9f369
("batman-adv: Add SPDX license identifier above copyright header"). All the
required information about the license conditions can be found in
LICENSES/.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2019-03-25 09:31:35 +01:00
Jiri Pirko
f6b19b354d net: devlink: select NET_DEVLINK from drivers
Some drivers are becoming more dependent on NET_DEVLINK being selected
in configuration. With upcoming compat functions, the behavior would be
wrong in case devlink was not compiled in. So make the drivers select
NET_DEVLINK and rely on the functions being there, not just stubs.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-24 14:55:31 -04:00
Jiri Pirko
b8f975545c net: devlink: add port type spinlock
Add spinlock to protect port type and type_dev pointer consistency.
Without that, userspace may see inconsistent type and type_dev
combinations.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
v1->v2:
- rebased
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-24 14:55:31 -04:00
Linus Torvalds
19caf581ba Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Thomas Gleixner:
 "A set of x86 fixes:

   - Prevent potential NULL pointer dereferences in the HPET and HyperV
     code

   - Exclude the GART aperture from /proc/kcore to prevent kernel
     crashes on access

   - Use the correct macros for Cyrix I/O on Geode processors

   - Remove yet another kernel address printk leak

   - Announce microcode reload completion as requested by quite some
     people. Microcode loading has become popular recently.

   - Some 'Make Clang' happy fixlets

   - A few cleanups for recently added code"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/gart: Exclude GART aperture from kcore
  x86/hw_breakpoints: Make default case in hw_breakpoint_arch_parse() return an error
  x86/mm/pti: Make local symbols static
  x86/cpu/cyrix: Remove {get,set}Cx86_old macros used for Cyrix processors
  x86/cpu/cyrix: Use correct macros for Cyrix calls on Geode processors
  x86/microcode: Announce reload operation's completion
  x86/hyperv: Prevent potential NULL pointer dereference
  x86/hpet: Prevent potential NULL pointer dereference
  x86/lib: Fix indentation issue, remove extra tab
  x86/boot: Restrict header scope to make Clang happy
  x86/mm: Don't leak kernel addresses
  x86/cpufeature: Fix various quality problems in the <asm/cpu_device_hd.h> header
2019-03-24 11:12:27 -07:00
Linus Torvalds
e08fef881d Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fixes from Thomas Gleixner:
 "A set of fixes for the interrupt subsystem:

   - Remove secondary GIC support on systems w/o device-tree support

   - A set of small fixlets in various irqchip drivers

   - static and fall-through annotations

   - Kernel doc and typo fixes"

* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  genirq: Mark expected switch case fall-through
  genirq/devres: Remove excess parameter from kernel doc
  irqchip/irq-mvebu-sei: Make mvebu_sei_ap806_caps static
  irqchip/mbigen: Don't clear eventid when freeing an MSI
  irqchip/stm32: Don't set rising configuration registers at init
  irqchip/stm32: Don't clear rising/falling config registers at init
  dt-bindings: irqchip: renesas-irqc: Document r8a774c0 support
  irqchip/mmp: Make mmp_irq_domain_ops static
  irqchip/brcmstb-l2: Make two init functions static
  genirq: Fix typo in comment of IRQD_MOVE_PCNTXT
  irqchip/gic-v3-its: Fix comparison logic in lpi_range_cmp
  irqchip/gic: Drop support for secondary GIC in non-DT systems
  irqchip/imx-irqsteer: Fix of_property_read_u32() error handling
2019-03-24 10:51:23 -07:00
Linus Torvalds
e0046bb302 Merge tag 'auxdisplay-for-linus-v5.1-rc2' of git://github.com/ojeda/linux
Pull auxdisplay updates from Miguel Ojeda:
 "A few fixes and improvements for auxdisplay:

   - Series to fix a memory leak in hd44780 while introducing
     charlcd_free(). From Andy Shevchenko

   - Series to clean up the Kconfig menus and a couple of improvements
     for charlcd. From Mans Rullgard"

* tag 'auxdisplay-for-linus-v5.1-rc2' of git://github.com/ojeda/linux:
  auxdisplay: charlcd: make backlight initial state configurable
  auxdisplay: charlcd: simplify init message display
  auxdisplay: deconfuse configuration
  auxdisplay: hd44780: Convert to use charlcd_free()
  auxdisplay: panel: Convert to use charlcd_free()
  auxdisplay: charlcd: Introduce charlcd_free() helper
  auxdisplay: charlcd: Move to_priv() to charlcd namespace
  auxdisplay: hd44780: Fix memory leak on ->remove()
2019-03-24 09:51:55 -07:00
David S. Miller
d64fee0a03 Merge tag 'mlx5-updates-2019-03-20' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:

====================
mlx5-updates-2019-03-20

This series includes updates to mlx5 driver,

1) Compiler warnings cleanup from Saeed Mahameed
2) Parav Pandit simplifies sriov enable/disables
3) Gustavo A. R. Silva, Removes a redundant assignment
4) Moshe Shemesh, Adds Geneve tunnel stateless offload support
5) Eli Britstein, Adds the Support for VLAN modify action and
   Replaces TC VLAN pop and push actions with VLAN modify

Note: This series includes two simple non-mlx5 patches,

1) Declare IANA_VXLAN_UDP_PORT definition in include/net/vxlan.h,
and use it in some drivers.
2) Declare GENEVE_UDP_PORT definition in include/net/geneve.h,
and use it in mlx5 and nfp drivers.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-23 22:03:44 -04:00
Eric Dumazet
8b27dae5a2 tcp: add one skb cache for rx
Often times, recvmsg() system calls and BH handling for a particular
TCP socket are done on different cpus.

This means the incoming skb had to be allocated on a cpu,
but freed on another.

This incurs a high spinlock contention in slab layer for small rpc,
but also a high number of cache line ping pongs for larger packets.

A full size GRO packet might use 45 page fragments, meaning
that up to 45 put_page() can be involved.

More over performing the __kfree_skb() in the recvmsg() context
adds a latency for user applications, and increase probability
of trapping them in backlog processing, since the BH handler
might found the socket owned by the user.

This patch, combined with the prior one increases the rpc
performance by about 10 % on servers with large number of cores.

(tcp_rr workload with 10,000 flows and 112 threads reach 9 Mpps
 instead of 8 Mpps)

This also increases single bulk flow performance on 40Gbit+ links,
since in this case there are often two cpus working in tandem :

 - CPU handling the NIC rx interrupts, feeding the receive queue,
  and (after this patch) freeing the skbs that were consumed.

 - CPU in recvmsg() system call, essentially 100 % busy copying out
  data to user space.

Having at most one skb in a per-socket cache has very little risk
of memory exhaustion, and since it is protected by socket lock,
its management is essentially free.

Note that if rps/rfs is used, we do not enable this feature, because
there is high chance that the same cpu is handling both the recvmsg()
system call and the TCP rx path, but that another cpu did the skb
allocations in the device driver right before the RPS/RFS logic.

To properly handle this case, it seems we would need to record
on which cpu skb was allocated, and use a different channel
to give skbs back to this cpu.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-23 21:57:38 -04:00
Eric Dumazet
472c2e07ee tcp: add one skb cache for tx
On hosts with a lot of cores, RPC workloads suffer from heavy contention on slab spinlocks.

    20.69%  [kernel]       [k] queued_spin_lock_slowpath
     5.64%  [kernel]       [k] _raw_spin_lock
     3.83%  [kernel]       [k] syscall_return_via_sysret
     3.48%  [kernel]       [k] __entry_text_start
     1.76%  [kernel]       [k] __netif_receive_skb_core
     1.64%  [kernel]       [k] __fget

For each sendmsg(), we allocate one skb, and free it at the time ACK packet comes.

In many cases, ACK packets are handled by another cpus, and this unfortunately
incurs heavy costs for slab layer.

This patch uses an extra pointer in socket structure, so that we try to reuse
the same skb and avoid these expensive costs.

We cache at most one skb per socket so this should be safe as far as
memory pressure is concerned.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-23 21:57:38 -04:00
Eric Dumazet
dc05360fee net: convert rps_needed and rfs_needed to new static branch api
We prefer static_branch_unlikely() over static_key_false() these days.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-23 21:57:38 -04:00
Paolo Abeni
28cff537ef net: sched: add empty status flag for NOLOCK qdisc
The queue is marked not empty after acquiring the seqlock,
and it's up to the NOLOCK qdisc clearing such flag on dequeue.
Since the empty status lays on the same cache-line of the
seqlock, it's always hot on cache during the updates.

This makes the empty flag update a little bit loosy. Given
the lack of synchronization between enqueue and dequeue, this
is unavoidable.

v2 -> v3:
 - qdisc_is_empty() has a const argument (Eric)

v1 -> v2:
 - use really an 'empty' flag instead of 'not_empty', as
   suggested by Eric

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Ivan Vecera <ivecera@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-23 21:52:36 -04:00
Soheil Hassas Yeganeh
576fd2f7ca tcp: add documentation for tcp_ca_state
Add documentation to the tcp_ca_state enum, since this enum is
exposed in uapi.

Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Sowmini Varadhan <sowmini05@gmail.com>
Acked-by: Sowmini Varadhan <sowmini05@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-23 21:50:05 -04:00
Linus Torvalds
1bdd3dbfff Merge tag 'io_uring-20190323' of git://git.kernel.dk/linux-block
Pull io_uring fixes and improvements from Jens Axboe:
 "The first five in this series are heavily inspired by the work Al did
  on the aio side to fix the races there.

  The last two re-introduce a feature that was in io_uring before it got
  merged, but which I pulled since we didn't have a good way to have
  BVEC iters that already have a stable reference. These aren't
  necessarily related to block, it's just how io_uring pins fixed
  buffers"

* tag 'io_uring-20190323' of git://git.kernel.dk/linux-block:
  block: add BIO_NO_PAGE_REF flag
  iov_iter: add ITER_BVEC_FLAG_NO_REF flag
  io_uring: mark me as the maintainer
  io_uring: retry bulk slab allocs as single allocs
  io_uring: fix poll races
  io_uring: fix fget/fput handling
  io_uring: add prepped flag
  io_uring: make io_read/write return an integer
  io_uring: use regular request ref counts
2019-03-23 10:25:12 -07:00
Linus Torvalds
2335cbe648 Merge tag 'for-linus-20190323' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
 "A set of fixes/changes that should go into this series. This contains:

   - Kernel doc / comment updates (Bart, Shenghui)

   - Un-export of core-only used function (Bart)

   - Fix race on loop file access (Dongli)

   - pf/pcd queue cleanup fixes (me)

   - Use appropriate helper for RESTART bit set (Yufen)

   - Use named identifier for classic poll (Yufen)"

* tag 'for-linus-20190323' of git://git.kernel.dk/linux-block:
  sbitmap: trivial - update comment for sbitmap_deferred_clear_bit
  blkcg: Fix kernel-doc warnings
  blk-iolatency: #include "blk.h"
  block: Unexport blk_mq_add_to_requeue_list()
  block: add BLK_MQ_POLL_CLASSIC for hybrid poll and return EINVAL for unexpected value
  blk-mq: remove unused 'nr_expired' from blk_mq_hw_ctx
  loop: access lo_backing_file only when the loop device is Lo_bound
  blk-mq: use blk_mq_sched_mark_restart_hctx to set RESTART
  paride/pcd: cleanup queues when detection fails
  paride/pf: cleanup queues when detection fails
2019-03-23 10:14:42 -07:00
Linus Torvalds
9a1050ad83 Merge tag 'ceph-for-5.1-rc2' of git://github.com/ceph/ceph-client
Pull ceph fixes from Ilya Dryomov:
 "A follow up for the new alloc_size logic and a blacklisting fix,
  marked for stable"

* tag 'ceph-for-5.1-rc2' of git://github.com/ceph/ceph-client:
  rbd: drop wait_for_latest_osdmap()
  libceph: wait for latest osdmap in ceph_monc_blacklist_add()
  rbd: set io_min, io_opt and discard_granularity to alloc_size
2019-03-23 10:04:47 -07:00
Kairui Song
ffc8599aa9 x86/gart: Exclude GART aperture from kcore
On machines where the GART aperture is mapped over physical RAM,
/proc/kcore contains the GART aperture range. Accessing the GART range via
/proc/kcore results in a kernel crash.

vmcore used to have the same issue, until it was fixed with commit
2a3e83c6f9 ("x86/gart: Exclude GART aperture from vmcore")', leveraging
existing hook infrastructure in vmcore to let /proc/vmcore return zeroes
when attempting to read the aperture region, and so it won't read from the
actual memory.

Apply the same workaround for kcore. First implement the same hook
infrastructure for kcore, then reuse the hook functions introduced in the
previous vmcore fix. Just with some minor adjustment, rename some functions
for more general usage, and simplify the hook infrastructure a bit as there
is no module usage yet.

Suggested-by: Baoquan He <bhe@redhat.com>
Signed-off-by: Kairui Song <kasong@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Jiri Bohac <jbohac@suse.cz>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Dave Young <dyoung@redhat.com>
Link: https://lkml.kernel.org/r/20190308030508.13548-1-kasong@redhat.com
2019-03-23 12:11:49 +01:00
Willem de Bruijn
868d523535 bpf: add bpf_skb_adjust_room encap flags
When pushing tunnel headers, annotate skbs in the same way as tunnel
devices.

For GSO packets, the network stack requires certain fields set to
segment packets with tunnel headers. gro_gse_segment depends on
transport and inner mac header, for instance.

Add an option to pass this information.

Remove the restriction on len_diff to network header length, which
is too short, e.g., for GRE protocols.

Changes
  v1->v2:
  - document new flags
  - BPF_F_ADJ_ROOM_MASK moved
  v2->v3:
  - BPF_F_ADJ_ROOM_ENCAP_L3_MASK moved

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-03-22 13:52:45 -07:00
Willem de Bruijn
2278f6cc15 bpf: add bpf_skb_adjust_room flag BPF_F_ADJ_ROOM_FIXED_GSO
bpf_skb_adjust_room adjusts gso_size of gso packets to account for the
pushed or popped header room.

This is not allowed with UDP, where gso_size delineates datagrams. Add
an option to avoid these updates and allow this call for datagrams.

It can also be used with TCP, when MSS is known to allow headroom,
e.g., through MSS clamping or route MTU.

Changes v1->v2:
  - document flag BPF_F_ADJ_ROOM_FIXED_GSO
  - do not expose BPF_F_ADJ_ROOM_MASK through uapi, as it may change.

Link: https://patchwork.ozlabs.org/patch/1052497/
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-03-22 13:52:45 -07:00
Willem de Bruijn
14aa31929b bpf: add bpf_skb_adjust_room mode BPF_ADJ_ROOM_MAC
bpf_skb_adjust_room net allows inserting room in an skb.

Existing mode BPF_ADJ_ROOM_NET inserts room after the network header
by pulling the skb, moving the network header forward and zeroing the
new space.

Add new mode BPF_ADJUST_ROOM_MAC that inserts room after the mac
header. This allows inserting tunnel headers in front of the network
header without having to recreate the network header in the original
space, avoiding two copies.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-03-22 13:52:45 -07:00
Eli Britstein
0eb69bb996 net/mlx5e: Add VLAN ID rewrite fields
Add VLAN ID rewrite fields as a pre-step to support this rewrite.

Signed-off-by: Eli Britstein <elibr@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-03-22 12:09:32 -07:00