Commit Graph

99157 Commits

Author SHA1 Message Date
Yixian Liu
7b48221cf4 RDMA/hns: Fix cqn type and init resp
This patch changes the type of cqn from u32 to u64 to keep
userspace and kernel consistent, initializes resp both for
cq and qp to zeros, and also changes the condition judgment
of outlen considering future caps extension.

Suggested-by: Jason Gunthorpe <jgg@mellanox.com>
Fixes: e088a685ea (hns: Support rq record doorbell for the user space)
Fixes: 9b44703d0a (hns: Support cq record doorbell for the user space)
Signed-off-by: Yixian Liu <liuyixian@huawei.com>
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Wei Hu (Xavier) <xavier.huwei@huawei.com>
Signed-off-by: Shaobo Xu <xushaobo2@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-03-15 15:34:26 -06:00
Parav Pandit
e41a7c4194 IB/core: Move rdma_addr_find_l2_eth_by_grh to core_priv.h
Before commit [1], rdma_addr_find_l2_eth_by_grh() was an exported function
and therefore declaration in include/rdma/ib_addr.h was fine.

But now that its scope is limited to ib_core module, its better to have it
in core_priv.h.

[1] commit 1060f86534 ("IB/{core/cm}: Fix generating a return AH for
RoCEE")

Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-03-15 15:33:39 -06:00
Parav Pandit
0a51415935 IB/core: Refactor ib_init_ah_attr_from_path() for RoCE
Resolving route for RoCE for a path record is needed only for the
received CM requests.
Therefore,
(a) ib_init_ah_attr_from_path() is refactored first to isolate the
code of resolving route.
(b) Setting dlid, path bits is not needed for RoCE.

Additionally ah attribute initialization is done from the path record
entry, so it is better to refer to path record entry type for
different link layer instead of ah attribute type while initializing
ah attribute itself.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-03-15 15:17:49 -06:00
Parav Pandit
a9c06aeba9 IB/core: Remove rdma_resolve_ip_route() as exported symbol
rdma_resolve_ip_route() is used only by ib_core module. Therefore it is
removed as an exported symbol.

Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-03-15 15:17:45 -06:00
Parav Pandit
b26c4a1138 IB/{core, ipoib}: Simplify ib_find_gid() for unused ndev
ib_find_gid() is only used by IPoIB driver. For IB link layer, GID table
entries are not based on netdevice. Netdevice parameter is unused here.
Therefore, it is removed.

Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-03-15 14:40:37 -06:00
Parav Pandit
6612b4983f IB/core: Fix comments of GID query functions
Exported symbol's comments should be with function definition and not in
the header file. Therefore comments of ib_find_cached_gid() and
ib_find_cached_gid_by_port() functions are moved closer to their
definitions.

The function name in then comment is different than the actual function
name, fix it to be same as ib_cache_gid_find_by_filter().

Also current comment section of ib_find_cached_gid_by_port() contains the
desciption of ib_find_cached_gid(), fix that as well.

Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-03-15 14:40:36 -06:00
Doug Ledford
2d873449a2 Merge branch 'k.o/wip/dl-for-rc' into k.o/wip/dl-for-next
Due to bug fixes found by the syzkaller bot and taken into the for-rc
branch after development for the 4.17 merge window had already started
being taken into the for-next branch, there were fairly non-trivial
merge issues that would need to be resolved between the for-rc branch
and the for-next branch.  This merge resolves those conflicts and
provides a unified base upon which ongoing development for 4.17 can
be based.

Conflicts:
	drivers/infiniband/hw/mlx5/main.c - Commit 42cea83f95
	(IB/mlx5: Fix cleanup order on unload) added to for-rc and
	commit b5ca15ad7e (IB/mlx5: Add proper representors support)
	add as part of the devel cycle both needed to modify the
	init/de-init functions used by mlx5.  To support the new
	representors, the new functions added by the cleanup patch
	needed to be made non-static, and the init/de-init list
	added by the representors patch needed to be modified to
	match the init/de-init list changes made by the cleanup
	patch.
Updates:
	drivers/infiniband/hw/mlx5/mlx5_ib.h - Update function
	prototypes added by representors patch to reflect new function
	names as changed by cleanup patch
	drivers/infiniband/hw/mlx5/ib_rep.c - Update init/de-init
	stage list to match new order from cleanup patch

Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-03-14 19:28:58 -04:00
Mark Bloch
72f7cc09b1 IB/mlx5: Expose more priorities for bypass namespace
BYPASS namespace is used by the RDMA side to insert flow rules into
the vport RX flow tables. Currently only 8 priorities are exposed,
increase this to 16 to allow more flexibility. This change will also
cause the BYPASS namespace to use 32 levels (as apposed to 16 today) of
flow tables, 16 levels for regular rules and 16 for don't trap rules.

Reviewed-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-03-14 16:05:16 -04:00
Leon Romanovsky
19b1f54099 RDMA/verbs: Simplify modify QP check
All callers to ib_modify_qp_is_ok() provides enum ib_qp_state
makes the checks of out-of-scope redundant. Let's remove them
together with updating function signature to return boolean result.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-03-14 15:34:25 -04:00
Yixian Liu
9b44703d0a RDMA/hns: Support cq record doorbell for the user space
This patch updates to support cq record doorbell for
the user space.

Signed-off-by: Yixian Liu <liuyixian@huawei.com>
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Wei Hu (Xavier) <xavier.huwei@huawei.com>
Signed-off-by: Shaobo Xu <xushaobo2@huawei.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-03-13 16:40:15 -04:00
Yixian Liu
e088a685ea RDMA/hns: Support rq record doorbell for the user space
This patch adds interfaces and definitions to support the rq record
doorbell for the user space.

Signed-off-by: Yixian Liu <liuyixian@huawei.com>
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Wei Hu (Xavier) <xavier.huwei@huawei.com>
Signed-off-by: Shaobo Xu <xushaobo2@huawei.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-03-13 16:40:15 -04:00
Boris Pismenny
c2b37f7648 IB/mlx5: Fix integer overflows in mlx5_ib_create_srq
This patch validates user provided input to prevent integer overflow due
to integer manipulation in the mlx5_ib_create_srq function.

Cc: syzkaller <syzkaller@googlegroups.com>
Fixes: e126ba97db ("mlx5: Add driver for Mellanox Connect-IB adapters")
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-03-13 16:31:21 -04:00
Doug Ledford
2ea32cd6c7 Merge tag 'mlx5-updates-2018-02-28-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux into k.o/wip/dl-for-next
mlx5-updates-2018-02-28-2 (IPSec-2)

This series follows our previous one to lay out the foundations for IPSec
in user-space and extend current kernel netdev IPSec support. As noted in
our previous pull request cover letter "mlx5-updates-2018-02-28-1 (IPSec-1)",
the IPSec mechanism will be supported through our flow steering mechanism.
Therefore, we need to change the initialization order. Furthermore, IPsec
is also supported in both egress and ingress. Since our current flow
steering is egress only, we add an empty (only implemented through FPGA
steering ops) egress namespace to handle that case. We also implement
the required flow steering callbacks and logic in our FPGA driver.

We extend the FPGA support for ESN and modifying a xfrm too. Therefore, we
add support for some new FPGA command interface that supports them. The
other required bits are added too. The new features and requirements are
advertised via cap bits.

Last but not least, we revise our driver's accel_esp API. This API will be
shared between our netdev and IB driver, so we need to have all the required
functionality from both worlds.

Regards,
Aviad and Matan

Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-03-13 15:49:34 -04:00
Steve Wise
29cf1351d4 RDMA/nldev: provide detailed PD information
Implement the RDMA nldev netlink interface for dumping detailed PD
information.

Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-03-08 15:03:03 -05:00
Steve Wise
fccec5b89a RDMA/nldev: provide detailed MR information
Implement the RDMA nldev netlink interface for dumping detailed
MR information.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-03-08 15:03:03 -05:00
Steve Wise
a34fc0893e RDMA/nldev: provide detailed CQ information
Implement the RDMA nldev netlink interface for dumping detailed
CQ information.

Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-03-08 15:03:03 -05:00
Steve Wise
00313983cd RDMA/nldev: provide detailed CM_ID information
Implement RDMA nldev netlink interface to get detailed CM_ID information.

Because cm_id's are attached to rdma devices in various work queue
contexts, the pid and task information at restrak_add() time is sometimes
not useful.  For example, an nvme/f host connection cm_id ends up being
bound to a device in a work queue context and the resulting pid at attach
time no longer exists after connection setup.  So instead we mark all
cm_id's created via the rdma_ucm as "user", and all others as "kernel".
This required tweaking the restrack code a little.  It also required
wrapping some rdma_cm functions to allow passing the module name string.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-03-08 15:03:03 -05:00
Aviad Yehezkel
cb01008390 net/mlx5: IPSec, Add support for ESN
Currently ESN is not supported with IPSec device offload.

This patch adds ESN support to IPsec device offload.
Implementing new xfrm device operation to synchronize offloading device
ESN with xfrm received SN. New QP command to update SA state at the
following:

           ESN 1                    ESN 2                  ESN 3
|-----------*-----------|-----------*-----------|-----------*
^           ^           ^           ^           ^           ^

^ - marks where QP command invoked to update the SA ESN state
    machine.
| - marks the start of the ESN scope (0-2^32-1). At this point move SA
    ESN overlap bit to zero and increment ESN.
* - marks the middle of the ESN scope (2^31). At this point move SA
    ESN overlap bit to one.

Signed-off-by: Aviad Yehezkel <aviadye@mellanox.com>
Signed-off-by: Yossef Efraim <yossefe@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-03-07 15:54:36 -08:00
Aviad Yehezkel
05564d0ae0 net/mlx5: Add flow-steering commands for FPGA IPSec implementation
In order to add a context to the FPGA, we need to get both the software
transform context (which includes the keys, etc) and the
source/destination IPs (which are included in the steering
rule). Therefore, we register new set of firmware like commands for
the FPGA. Each time a rule is added, the steering core infrastructure
calls the FPGA command layer. If the rule is intended for the FPGA,
it combines the IPs information with the software transformation
context and creates the respective hardware transform.
Afterwards, it calls the standard steering command layer.

Signed-off-by: Aviad Yehezkel <aviadye@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-03-07 15:54:35 -08:00
Aviad Yehezkel
d6c4f0298c net/mlx5: Refactor accel IPSec code
The current code has one layer that executed FPGA commands and
the Ethernet part directly used this code. Since downstream patches
introduces support for IPSec in mlx5_ib, we need to provide some
abstractions. This patch refactors the accel code into one layer
that creates a software IPSec transformation and another one which
creates the actual hardware context.
The internal command implementation is now hidden in the FPGA
core layer. The code also adds the ability to share FPGA hardware
contexts. If two contexts are the same, only a reference count
is taken.

Signed-off-by: Aviad Yehezkel <aviadye@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-03-07 15:54:34 -08:00
Aviad Yehezkel
af9fe19d66 net/mlx5: Added required metadata capability for ipsec
Currently our device requires additional metadata in packet
to perform ipsec crypto offload.

Signed-off-by: Aviad Yehezkel <aviadye@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-03-07 15:54:34 -08:00
Aviad Yehezkel
1d2005e204 net/mlx5: Export ipsec capabilities
We will need that for ipsec verbs.

Signed-off-by: Aviad Yehezkel <aviadye@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-03-07 15:54:30 -08:00
Aviad Yehezkel
65802f4800 net/mlx5: IPSec, Add command V2 support
This patch adds V2 command support.
New fpga devices support extended features (udp encap, esn etc...), this
features require new hardware sadb format therefore we have a new version
of commands to manipulate it.

Signed-off-by: Yossef Efraim <yossefe@mellanox.com>
Signed-off-by: Aviad Yehezkel <aviadye@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-03-07 15:53:18 -08:00
Yossi Kuperman
788a821076 net/mlx5e: IPSec, Add support for ESP trailer removal by hardware
Current hardware decrypts and authenticates incoming ESP packets.
Subsequently, the software extracts the nexthdr field, truncates the
trailer and adjusts csum accordingly.

With this patch and a capable device, the trailer is being removed
by the hardware and the nexthdr field is conveyed via PET. This way
we avoid both the need to access the trailer (cache miss) and to
compute its relative checksum, which significantly improve
the performance.

Experiment shows that trailer removal improves the performance by
2Gbps, (netperf). Both forwarding and host-to-host configurations.

Signed-off-by: Yossi Kuperman <yossiku@mellanox.com>
Signed-off-by: Aviad Yehezkel <aviadye@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-03-07 15:53:18 -08:00
Yossi Kuperman
581fdddee4 net/mlx5: IPSec, Generalize sandbox QP commands
The current code assume only SA QP commands.
Refactor in order to pave the way for new QP commands:
1. Generic cmd response format.
2. SA cmd checks are in dedicated functions.
3. Aligned debug prints.

Signed-off-by: Yossi Kuperman <yossiku@mellanox.com>
Signed-off-by: Aviad Yehezkel <aviadye@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-03-07 15:53:17 -08:00
Yishai Hadas
d50a8a96ee IB/mlx4: Move mlx4_uverbs_ex_query_device_resp to include/uapi/
This struct is involved in the user API for mlx4 and should not be hidden
inside a driver header file.

Fixes: 09d208b258 ("IB/mlx4: Add report for RSS capabilities by vendor channel")
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-03-07 16:10:07 -07:00
Doug Ledford
1abb791fcd Merge tag 'mlx5-updates-2018-02-28-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux into k.o/wip/dl-for-next
mlx5-updates-2018-02-28-1 (IPSec-1)

This series consists of some fixes and refactors for the mlx5 drivers,
especially around the FPGA and flow steering. Most of them are trivial
fixes and are the foundation of allowing IPSec acceleration from user-space.

We use flow steering abstraction in order to accelerate IPSec packets.
When a user creates a steering rule, [s]he states that we'll carry an
encrypt/decrypt flow action (using a specific configuration) for every
packet which conforms to a certain match. Since currently offloading these
packets is done via FPGA, we'll add another set of flow steering ops.
These ops will execute the required FPGA commands and then call the
standard steering ops.

In order to achieve this, we need that the commands will get all the
required information. Therefore, we pass the fte object and embed the
flow_action struct inside the fte. In addition, we add the shim layer
that will later be used for alternating between the standard and the
FPGA steering commands.

Some fixes, like " net/mlx5e: Wait for FPGA command responses with a timeout"
are very relevant for user-space applications, as these applications could
be killed, but we still want to wait for the FPGA and update the kernel's
database.

Regards,
Aviad and Matan

Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-03-07 15:56:39 -07:00
Boris Pismenny
3346c48737 {net,IB}/mlx5: Add flow steering helpers
Add helper functions that check if a protocol is
part of a flow steering match criteria.

Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Aviad Yehezkel <aviadye@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-03-06 22:20:14 -08:00
Aviad Yehezkel
5f4183781a net/mlx5: Add empty egress namespace to flow steering core
Currently, we don't support egress flow steering namespace in mlx5
flow steering core implementation. However, when we want to encrypt
a packet, we model it as a flow steering rule in the egress path.
To overcome this, we add an empty egress namespace to flow steering.
This namespace is initialized only when ipsec support exists.
In the future, this will grow to a full blown full steering
implementation, resembling the ingress path.

Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Aviad Yehezkel <aviadye@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-03-06 22:20:13 -08:00
Matan Barak
a9db0ecf15 {net,IB}/mlx5: Add has_tag to mlx5_flow_act
The has_tag member will indicate whether a tag action was specified
in flow specification.

A flow tag 0 = MLX5_FS_DEFAULT_FLOW_TAG is assumed a valid flow tag
that is currently used by mlx5 RDMA driver, whereas in HW flow_tag = 0
means that the user doesn't care about flow_tag.  HW always provide
a flow_tag = 0 if all flow tags requested on a specific flow are 0.

So we need a way (in the driver) to differentiate between a user really
requesting flow_tag = 0 and a user who does not care, in order to be
able to report conflicting flow tags on a specific flow.

Signed-off-by: Matan Barak <matanb@mellanox.com>
Reviewed-by: Aviad Yehezkel <aviadye@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-03-06 22:06:33 -08:00
Doug Ledford
1d1ab1ae69 Merge tag 'mlx5-updates-2018-02-23' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux into k.o/wip/dl-for-next
mlx5-update-2018-02-23 (IB representors)

From: Mark Bloch <markb@mellanox.com>
=========
Add IB representor when in switchdev mode

The following series adds support for an IB (RAW Ethernet only) device
representor which is created when the user switches to switchdev mode.

Today when switching to switchdev mode the only representors which are
created are net devices. Each netdev is a representor of a virtual
function and any data sent via the representor is received on the virtual
function, and any data sent via the virtual function is received by the
representor.

For the mlx5 driver the main use of this functionality is to be able to
use Open vSwitch on the hypervisor in order to manage/control traffic
from/to the virtual functions. Open vSwitch can also work with  DPDK
devices and not just net devices, this series exposes an IB device, which
Mellanox PMD driver uses, which then can be used by Open vSwitch DPDK.

An IB device representor exposes only RAW Ethernet QP capabilities and
the ability to create flow rules to direct traffic to its RX queues. The
state of the IB device (ACTIVE/DOWN etc..) is based on the state of the
corresponding net device representor. No other RDMA/RoCE functionality is
currently supported and no GID table is exposed.
=========

Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-02-28 13:37:39 -05:00
Mark Bloch
5e65b02c00 net/mlx5: E-Switch, Add definition of IB representor
Create a new representor type: REP_IB. which will be initialized by an IB
device that is used as a logical representor of a eswitch vport (VF or
uplink) just like we have a net device today in switchdev mode.

Signed-off-by: Mark Bloch <markb@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-02-23 12:36:38 -08:00
Mark Bloch
57cbd893c4 net/mlx5: E-Switch, Move representors definition to a global scope
In preparation for IB representors, move representors structs to a global
scope, also expose functions needed for registration, unregistration,
eswitch mode and creating a flow rule to direct traffic from SQs to the
right VF.

Signed-off-by: Mark Bloch <markb@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-02-23 12:36:38 -08:00
Leon Romanovsky
372e15c5db RDMA/uverbs: Reduce number of command header flags checks
Simplify the code by directly checking the availability of extended
command flog instead of doing multiple shift operations.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-02-22 22:31:18 -05:00
Doug Ledford
4bb46608ed Merge branch 'k.o/for-rc' into k.o/wip/dl-for-next
There is a 14 patch series waiting to come into for-next that has a
dependecy on code submitted into this kernel's for-rc series.  So, merge
the for-rc branch into the current for-next in order to make the patch
series apply cleanly.

Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-02-22 22:27:20 -05:00
Doug Ledford
f76a5c75d9 Merge tag 'mlx5-updates-2018-02-21' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux into k.o/wip/dl-for-next
mlx5-updates-2018-02-21

This series includes shared code updates for mlx5 core driver for both
netdev and rdma subsystems.

By Saeed,
First six patches of the series are meant to address a performance issue
and should provide a performance boost for multi core IRQ interrupt hungry
workloads.  The issue is fixed in the first patch, all other patches are
meant to refactor the code in light of this fix.

The problem it comes to fix, is a shared spinlock accessed across all HCA
IRQs which protects the CQ database.  To solve this we simply move the CQ
database and its spinlock to be per EQ (IRQ), thus per core.

By Yonatan,
Fragmented completion queue (CQ) for RDMA,
core driver implementation to create fragmented CQ buffers rather than
one large contiguous memory buffer, the implementation scheme already
exist and used by the netdev CQs, the patch shares that code with the
rdma CQ creation flow and makes use of the new API in mlx5_ib driver.

Thanks,
Saeed.

Merged into rdma-next tree as well as net-next tree to prevent conflicts
in future patches between the two trees.

Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-02-22 20:52:28 -05:00
Jason Gunthorpe
5d2beb576d IB/uverbs: Use __aligned_u64 for uapi headers
This has no impact on the structure layout since these structs already
have their u64s already properly aligned, but it does document that we
have this requirement for 32 bit compatibility.

Signed-off-by: Leon Romanovsky <leon@kernel.org>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-02-15 14:59:45 -07:00
Jason Gunthorpe
2f36028ce9 IB/uverbs: Use u64_to_user_ptr() not a union
The union approach will get the endianness wrong sometimes if the kernel's
pointer size is 32 bits resulting in EFAULTs when trying to copy to/from
user.

Signed-off-by: Leon Romanovsky <leon@kernel.org>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-02-15 14:59:45 -07:00
Matan Barak
89d9e8d3f1 IB/uverbs: Always use the attribute size provided by the user
This fixes several bugs around the copy_to/from user path:
 - copy_to used the user provided size of the attribute
   and could copy data beyond the end of the kernel buffer into
   userspace.
 - copy_from didn't know the size of the kernel buffer and
   could have left kernel memory unexpectedly un-initialized.
 - copy_from did not use the user length to determine if the
   attribute data is inlined or not.

Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-02-15 14:59:44 -07:00
Leon Romanovsky
415bb699d7 RDMA/restrack: Remove unimplemented XRCD object
Resource tracking of XRCD objects is not implemented in current
version of restrack and hence can be removed.

Fixes: 02d8883f52 ("RDMA/restrack: Add general infrastructure to track RDMA resources")
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-02-15 14:59:44 -07:00
Yonatan Cohen
388ca8be00 IB/mlx5: Implement fragmented completion queue (CQ)
The current implementation of create CQ requires contiguous
memory, such requirement is problematic once the memory is
fragmented or the system is low in memory, it causes for
failures in dma_zalloc_coherent().

This patch implements new scheme of fragmented CQ to overcome
this issue by introducing new type: 'struct mlx5_frag_buf_ctrl'
to allocate fragmented buffers, rather than contiguous ones.

Base the Completion Queues (CQs) on this new fragmented buffer.

It fixes following crashes:
kworker/29:0: page allocation failure: order:6, mode:0x80d0
CPU: 29 PID: 8374 Comm: kworker/29:0 Tainted: G OE 3.10.0
Workqueue: ib_cm cm_work_handler [ib_cm]
Call Trace:
[<>] dump_stack+0x19/0x1b
[<>] warn_alloc_failed+0x110/0x180
[<>] __alloc_pages_slowpath+0x6b7/0x725
[<>] __alloc_pages_nodemask+0x405/0x420
[<>] dma_generic_alloc_coherent+0x8f/0x140
[<>] x86_swiotlb_alloc_coherent+0x21/0x50
[<>] mlx5_dma_zalloc_coherent_node+0xad/0x110 [mlx5_core]
[<>] ? mlx5_db_alloc_node+0x69/0x1b0 [mlx5_core]
[<>] mlx5_buf_alloc_node+0x3e/0xa0 [mlx5_core]
[<>] mlx5_buf_alloc+0x14/0x20 [mlx5_core]
[<>] create_cq_kernel+0x90/0x1f0 [mlx5_ib]
[<>] mlx5_ib_create_cq+0x3b0/0x4e0 [mlx5_ib]

Signed-off-by: Yonatan Cohen <yonatanc@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-02-15 00:30:03 -08:00
Saeed Mahameed
3ec5693b17 net/mlx5: Remove redundant EQ API exports
EQ structure and API is private to mlx5_core driver only, external
drivers should not have access or the means to manipulate EQ objects.

Remove redundant exports and move API functions out of the linux/mlx5
include directory into the driver's mlx5_core.h private include file.

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Reviewed-by: Gal Pressman <galp@mellanox.com>
2018-02-15 00:30:02 -08:00
Saeed Mahameed
3ac7afdbcf net/mlx5: Move CQ completion and event forwarding logic to eq.c
Since CQ tree is now per EQ, CQ completion and event forwarding became
specific implementation of EQ logic, this patch moves that logic to eq.c
and makes those functions static.

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Reviewed-by: Gal Pressman <galp@mellanox.com>
2018-02-15 00:30:02 -08:00
Saeed Mahameed
f105b45bf7 net/mlx5: CQ hold/put API
Now as the CQ table is per EQ, add an API to hold/put CQ to be used from
eq.c in downstream patch.

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Reviewed-by: Gal Pressman <galp@mellanox.com>
2018-02-15 00:30:01 -08:00
Saeed Mahameed
02d92f7903 net/mlx5: CQ Database per EQ
Before this patch the driver had one CQ database protected via one
spinlock, this spinlock is meant to synchronize between CQ
adding/removing and CQ IRQ interrupt handling.

On a system with large number of CPUs and on a work load that requires
lots of interrupts, this global spinlock becomes a very nasty hotspot
and introduces a contention between the active cores, which will
significantly hurt performance and becomes a bottleneck that prevents
seamless cpu scaling.

To solve this we simply move the CQ database and its spinlock to be per
EQ (IRQ), thus per core.

Tested with:
system: 2 sockets, 14 cores per socket, hyperthreading, 2x14x2=56 cores
netperf command: ./super_netperf 200 -P 0 -t TCP_RR  -H <server> -l 30 -- -r 300,300 -o -s 1M,1M -S 1M,1M

WITHOUT THIS PATCH:
Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft %steal  %guest  %gnice   %idle
Average:     all    4.32    0.00   36.15    0.09    0.00   34.02   0.00    0.00    0.00   25.41

Samples: 2M of event 'cycles:pp', Event count (approx.): 1554616897271
Overhead  Command          Shared Object                 Symbol
+   14.28%  swapper          [kernel.vmlinux]              [k] intel_idle
+   12.25%  swapper          [kernel.vmlinux]              [k] queued_spin_lock_slowpath
+   10.29%  netserver        [kernel.vmlinux]              [k] queued_spin_lock_slowpath
+    1.32%  netserver        [kernel.vmlinux]              [k] mlx5e_xmit

WITH THIS PATCH:
Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
Average:     all    4.27    0.00   34.31    0.01    0.00   18.71    0.00    0.00    0.00   42.69

Samples: 2M of event 'cycles:pp', Event count (approx.): 1498132937483
Overhead  Command          Shared Object             Symbol
+   23.33%  swapper          [kernel.vmlinux]          [k] intel_idle
+    1.69%  netserver        [kernel.vmlinux]          [k] mlx5e_xmit

Tested-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Reviewed-by: Gal Pressman <galp@mellanox.com>
2018-02-15 00:29:54 -08:00
Jason Gunthorpe
7061f28d8a rxe: Do not use 'struct sockaddr' in a uapi header
Linux has two 'linux/socket.h' files - and only the one in the kernel
defines struct sockaddr - the user space one does not.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-02-14 16:31:35 -07:00
Jason Gunthorpe
5229f87efc RDMA: Do not used __packed in uapi headers
__packed is not available in linux/types.h, so we cannot use it in
the uapi headers.

The construction struct ABC {} __packed; may still compile even if
__packed is not defined, however it simply creates a variable called
__packed, and doesn't set the alignment.

All these uses of packed are on structs that already have aligned
members.

While use in hfi may indicate the struct itself is unaligned,
the use in ocrdma is on a UHW struct which should never be unaligned,
so just delete it there.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-02-14 16:31:35 -07:00
Al Viro
7a163b2195 unify {de,}mangle_poll(), get rid of kernel-side POLL...
except, again, POLLFREE and POLL_BUSY_LOOP.

With this, we finally get to the promised end result:

 - POLL{IN,OUT,...} are plain integers and *not* in __poll_t, so any
   stray instances of ->poll() still using those will be caught by
   sparse.

 - eventpoll.c and select.c warning-free wrt __poll_t

 - no more kernel-side definitions of POLL... - userland ones are
   visible through the entire kernel (and used pretty much only for
   mangle/demangle)

 - same behavior as after the first series (i.e. sparc et.al. epoll(2)
   working correctly).

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-02-11 14:37:22 -08:00
Linus Torvalds
a9a08845e9 vfs: do bulk POLL* -> EPOLL* replacement
This is the mindless scripted replacement of kernel use of POLL*
variables as described by Al, done by this script:

    for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do
        L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'`
        for f in $L; do sed -i "-es/^\([^\"]*\)\(\<POLL$V\>\)/\\1E\\2/" $f; done
    done

with de-mangling cleanups yet to come.

NOTE! On almost all architectures, the EPOLL* constants have the same
values as the POLL* constants do.  But they keyword here is "almost".
For various bad reasons they aren't the same, and epoll() doesn't
actually work quite correctly in some cases due to this on Sparc et al.

The next patch from Al will sort out the final differences, and we
should be all done.

Scripted-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-02-11 14:34:03 -08:00
Linus Torvalds
ee5daa1361 Merge branch 'work.poll2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull more poll annotation updates from Al Viro:
 "This is preparation to solving the problems you've mentioned in the
  original poll series.

  After this series, the kernel is ready for running

      for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do
            L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'`
            for f in $L; do sed -i "-es/^\([^\"]*\)\(\<POLL$V\>\)/\\1E\\2/" $f; done
      done

  as a for bulk search-and-replace.

  After that, the kernel is ready to apply the patch to unify
  {de,}mangle_poll(), and then get rid of kernel-side POLL... uses
  entirely, and we should be all done with that stuff.

  Basically, that's what you suggested wrt KPOLL..., except that we can
  use EPOLL... instead - they already are arch-independent (and equal to
  what is currently kernel-side POLL...).

  After the preparations (in this series) switch to returning EPOLL...
  from ->poll() instances is completely mechanical and kernel-side
  POLL... can go away. The last step (killing kernel-side POLL... and
  unifying {de,}mangle_poll() has to be done after the
  search-and-replace job, since we need userland-side POLL... for
  unified {de,}mangle_poll(), thus the cherry-pick at the last step.

  After that we will have:

   - POLL{IN,OUT,...} *not* in __poll_t, so any stray instances of
     ->poll() still using those will be caught by sparse.

   - eventpoll.c and select.c warning-free wrt __poll_t

   - no more kernel-side definitions of POLL... - userland ones are
     visible through the entire kernel (and used pretty much only for
     mangle/demangle)

   - same behavior as after the first series (i.e. sparc et.al. epoll(2)
     working correctly)"

* 'work.poll2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  annotate ep_scan_ready_list()
  ep_send_events_proc(): return result via esed->res
  preparation to switching ->poll() to returning EPOLL...
  add EPOLLNVAL, annotate EPOLL... and event_poll->event
  use linux/poll.h instead of asm/poll.h
  xen: fix poll misannotation
  smc: missing poll annotations
2018-02-11 13:57:19 -08:00