In the original rxe implementation it was intended to use a common object
to represent MRs and MWs but they are different enough to separate these
into two objects.
This allows replacing the mem name with mr for MRs which is more
consistent with the style for the other objects and less likely to be
confusing. This is a long patch that mostly changes mem to mr where it
makes sense and adds a new rxe_mw struct.
Link: https://lore.kernel.org/r/20210325212425.2792-1-rpearson@hpe.com
Signed-off-by: Bob Pearson <rpearson@hpe.com>
Acked-by: Zhu Yanjun <zyjzyj2000@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The device is got by isert_device_get() with refcount is 1, and is
assigned to isert_conn by
isert_conn->device = device.
When isert_create_qp() failed, device will be freed with
isert_device_put().
Later, the device is used in isert_free_login_buf(isert_conn) by the
isert_conn->device->ib_device statement.
Free the device in the correct order.
Fixes: ae9ea9ed38 ("iser-target: Split some logic in isert_connect_request to routines")
Link: https://lore.kernel.org/r/20210322161325.7491-1-lyl2019@mail.ustc.edu.cn
Signed-off-by: Lv Yunlong <lyl2019@mail.ustc.edu.cn>
Acked-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Currently, ODP caps are set during the init stage of mlx5_ib_dev,
regardless of whether the device profile supports ODP or not. There is no
point in setting ODP caps if the device profile doesn't support
ODP. Hence, move setting the ODP caps to the odp_init stage.
Link: https://lore.kernel.org/r/20210318135259.681264-1-leon@kernel.org
Reviewed-by: Maor Gottlieb <maorg@nvidia.com>
Signed-off-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Initial drop action support missed that drop action can be added to egress
flow tables as well. Add the missing support.
This requires making sure that dest_type isn't set to PORT which in turn
exposes a possibility of passing dst while indicating number of dsts as
zero. Explicitly check for number of dsts and pass the appropriate
pointer.
Fixes: f29de9eee7 ("RDMA/mlx5: Add support for drop action in DV steering")
Link: https://lore.kernel.org/r/20210318135123.680759-1-leon@kernel.org
Reviewed-by: Mark Bloch <markb@nvidia.com>
Signed-off-by: Maor Gottlieb <maorg@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Current code uses many different types when dealing with a port of a RDMA
device: u8, unsigned int and u32. Switch to u32 to clean up the logic.
This allows us to make (at least) the core view consistent and use the
same type. Unfortunately not all places can be converted. Many uverbs
functions expect port to be u8 so keep those places in order not to break
UAPIs. HW/Spec defined values must also not be changed.
With the switch to u32 we now can support devices with more than 255
ports. U32_MAX is reserved to make control logic a bit easier to deal
with. As a device with U32_MAX ports probably isn't going to happen any
time soon this seems like a non issue.
When a device with more than 255 ports is created uverbs will report the
RDMA device as having 255 ports as this is the max currently supported.
The verbs interface is not changed yet because the IBTA spec limits the
port size in too many places to be u8 and all applications that relies in
verbs won't be able to cope with this change. At this stage, we are
extending the interfaces that are using vendor channel solely
Once the limitation is lifted mlx5 in switchdev mode will be able to have
thousands of SFs created by the device. As the only instance of an RDMA
device that reports more than 255 ports will be a representor device and
it exposes itself as a RAW Ethernet only device CM/MAD/IPoIB and other
ULPs aren't effected by this change and their sysfs/interfaces that are
exposes to userspace can remain unchanged.
While here cleanup some alignment issues and remove unneeded sanity
checks (mainly in rdmavt),
Link: https://lore.kernel.org/r/20210301070420.439400-1-leon@kernel.org
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
There is no need to create the ODP EQ if the user doesn't use ODP MRs.
Hence, create it only when the first ODP MR is created. This EQ will be
destroyed only when the device is unloaded.
This will decrease the number of EQs created per device. for example: If
we creates 1K devices (SF/VF/etc'), than we will decrease the num of EQs
by 1K.
Link: https://lore.kernel.org/r/20210314125418.179716-1-leon@kernel.org
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Maor Gottlieb <maorg@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
This patch fixes bunch of kernel-doc compilation warnings like below:
drivers/infiniband/hw/i40iw/i40iw_cm.c:4372: warning: expecting prototype for i40iw_ifdown_notify(). Prototype was for i40iw_if_notify() instead
Link: https://lore.kernel.org/r/20210314133908.291945-2-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The umem DMA list calculation was locked at 4k pages due to confusion
around how this API works and is used when larger pages are present.
The conclusion is:
- umem's cannot extend past what is mapped into the process, so creating
a lage page size and referring to a sub-range is not allowed
- umem's must always have a page offset of zero, except for sub PAGE_SIZE
umems
- The feature of umem_offset to create multiple objects inside a umem
is buggy and isn't used anyplace. Thus we can assume all users of the
current API have umem_offset == 0 as well
Provide a new page size calculator that limits the DMA list to the VA
range and enforces umem_offset == 0.
Allow user space to specify the page sizes which it can accept, this
bitmap must be derived from the intended use of the umem, based on
per-usage HW limitations.
Link: https://lore.kernel.org/r/20210304130501.1102577-4-leon@kernel.org
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Change uverbs_get_const/uverbs_get_const_default to work properly with
both signed/unsigned parameters.
Current APIs mix s64 and u64 which leads to incorrect check when u64
value was supplied and its upper bit was set. In that case
uverbs_get_const() / uverbs_get_const_default() lower bound check may
fail unexpectedly, target is unsigned (lower bound is 0) but value
became negative as of the s64 usage.
Split to have two different APIs, no change to callers as the required
API will be called internally according to the target type.
Link: https://lore.kernel.org/r/20210304130501.1102577-3-leon@kernel.org
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The WARN_ON() issued as part of ib_umem_find_best_pgsz() blocked cases
when only page sizes larger than PAGE_SIZE were set, drop it to enable
those cases.
In addition, there is no need to have a specific check for zero
pgsz_bitmap, the function will do its job and return 0 at the end if
nothing match will be found.
Link: https://lore.kernel.org/r/20210304130501.1102577-2-leon@kernel.org
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
mlx5_is_roce_enabled returns the devlink RoCE init value, therefore it
should be used only when driver is loaded. Instead we just need to read
the roce_en field.
In addition, rename mlx5_is_roce_enabled to mlx5_is_roce_init_enabled.
Fixes: 7a58779edd ("IB/mlx5: Improve query port for representor port")
Link: https://lore.kernel.org/r/20210304124517.1100608-2-leon@kernel.org
Signed-off-by: Maor Gottlieb <maorg@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Now that this is only used in a few places in mr.c give it a sensible
name. It has nothing to do with the cache and can be invoked on any
MR. DMA is stopped and the user cannot touch the MR any further once it
completes.
Link: https://lore.kernel.org/r/20210304120745.1090751-5-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Now that the SRCU stuff has been removed the entire MR destroy logic can
be made a lot simpler. Currently there are many different ways to destroy a
MR and it makes it really hard to do this task correctly. Route all
destruction through mlx5_ib_dereg_mr() and make it work for all
situations.
Since it turns out all the different MR types do basically the same thing
this removes a lot of knowledge of MR internals from ODP and leaves ODP
just exporting an operation to clean up children.
This fixes a few weird corner cases bugs and firmly uses the correct
ordering of the MR destruction:
- Stop parallel access to the mkey via the ODP xarray
- Stop DMA
- Release the umem
- Clean up ODP children
- Free/Recycle the MR
Link: https://lore.kernel.org/r/20210304120745.1090751-4-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The struct mlx5_ib_mr can be used for three different things, but only one
at a time:
- In the user MR cache
- As a kernel MR
- As a user MR
Overlay the three things into a single union with the following rules:
- If the mr is found on the cache_ent->head list then it is a cache MR
and umem == NULL. The entire union is zero after the MR is removed from
the cache.
- If umem != NULL or type == IB_MR_TYPE_USER then it is a user MR.
- If umem == NULL then it is a kernel MR
This reduces the size of struct mlx5_ib_mr to 552 bytes from 702.
The only place the three flows overlap in the code is during dereg, so add
a few extra checks along there.
Link: https://lore.kernel.org/r/20210304120745.1090751-3-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
All of the ODP code assumes when it calls mlx5_mr_cache_alloc() the ODP
related fields are zero'd. This is true if the MR was just allocated, but
if the MR is recycled through the cache then the values are never zero'd.
This causes a bug in the odp_stats, they don't reset when the MR is
reallocated, also is_odp_implicit is never 0'd.
So we can use memset on a block of the mlx5_ib_mr reorganize the structure
to put all the data that can be zero'd by the cache at the end.
It is organized as an anonymous struct because the next patch will make
this a union.
Delete the unused smr_info. Don't set the kernel only desc_size on the
user path. No longer any need to zero mr->parent before freeing it, the
memset() will get it now.
Fixes: a3de94e3d6 ("IB/mlx5: Introduce ODP diagnostic counters")
Link: https://lore.kernel.org/r/20210304120745.1090751-2-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Pull rdma fixes from Jason Gunthorpe:
"Nothing special here, though Bob's regression fixes for rxe would have
made it before the rc cycle had there not been such strong winter
weather!
- Fix corner cases in the rxe reference counting cleanup that are
causing regressions in blktests for SRP
- Two kdoc fixes so W=1 is clean
- Missing error return in error unwind for mlx5
- Wrong lock type nesting in IB CM"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
RDMA/rxe: Fix errant WARN_ONCE in rxe_completer()
RDMA/rxe: Fix extra deref in rxe_rcv_mcast_pkt()
RDMA/rxe: Fix missed IB reference counting in loopback
RDMA/uverbs: Fix kernel-doc warning of _uverbs_alloc
RDMA/mlx5: Set correct kernel-doc identifier
IB/mlx5: Add missing error code
RDMA/rxe: Fix missing kconfig dependency on CRYPTO
RDMA/cm: Fix IRQ restore in ib_send_cm_sidr_rep
Pull gcc-plugins fixes from Kees Cook:
"Tiny gcc-plugin fixes for v5.12-rc2. These issues are small but have
been reported a couple times now by static analyzers, so best to get
them fixed to reduce the noise. :)
- Fix coding style issues (Jason Yan)"
* tag 'gcc-plugins-v5.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
gcc-plugins: latent_entropy: remove unneeded semicolon
gcc-plugins: structleak: remove unneeded variable 'ret'
Pull device mapper fixes from Mike Snitzer:
"Fix DM verity target's optional Forward Error Correction (FEC) for
Reed-Solomon roots that are unaligned to block size"
* tag 'for-5.12/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm verity: fix FEC for RS roots unaligned to block size
dm bufio: subtract the number of initial sectors in dm_bufio_get_device_size
Pull block fixes from Jens Axboe:
- NVMe fixes:
- more device quirks (Julian Einwag, Zoltán Böszörményi, Pascal
Terjan)
- fix a hwmon error return (Daniel Wagner)
- fix the keep alive timeout initialization (Martin George)
- ensure the model_number can't be changed on a used subsystem
(Max Gurtovoy)
- rsxx missing -EFAULT on copy_to_user() failure (Dan)
- rsxx remove unused linux.h include (Tian)
- kill unused RQF_SORTED (Jean)
- updated outdated BFQ comments (Joseph)
- revert work-around commit for bd_size_lock, since we removed the
offending user in this merge window (Damien)
* tag 'block-5.12-2021-03-05' of git://git.kernel.dk/linux-block:
nvmet: model_number must be immutable once set
nvme-fabrics: fix kato initialization
nvme-hwmon: Return error code when registration fails
nvme-pci: add quirks for Lexar 256GB SSD
nvme-pci: mark Kingston SKC2000 as not supporting the deepest power state
nvme-pci: mark Seagate Nytro XM1440 as QUIRK_NO_NS_DESC_LIST.
rsxx: Return -EFAULT if copy_to_user() fails
block/bfq: update comments and default value in docs for fifo_expire
rsxx: remove unused including <linux/version.h>
block: Drop leftover references to RQF_SORTED
block: revert "block: fix bd_size_lock use"
Pull io_uring fixes from Jens Axboe:
"A bit of a mix between fallout from the worker change, cleanups and
reductions now possible from that change, and fixes in general. In
detail:
- Fully serialize manager and worker creation, fixing races due to
that.
- Clean up some naming that had gone stale.
- SQPOLL fixes.
- Fix race condition around task_work rework that went into this
merge window.
- Implement unshare. Used for when the original task does unshare(2)
or setuid/seteuid and friends, drops the original workers and forks
new ones.
- Drop the only remaining piece of state shuffling we had left, which
was cred. Move it into issue instead, and we can drop all of that
code too.
- Kill f_op->flush() usage. That was such a nasty hack that we had
out of necessity, we no longer need it.
- Following from ->flush() removal, we can also drop various bits of
ctx state related to SQPOLL and cancelations.
- Fix an issue with IOPOLL retry, which originally was fallout from a
filemap change (removing iov_iter_revert()), but uncovered an issue
with iovec re-import too late.
- Fix an issue with system suspend.
- Use xchg() for fallback work, instead of cmpxchg().
- Properly destroy io-wq on exec.
- Add create_io_thread() core helper, and use that in io-wq and
io_uring. This allows us to remove various silly completion events
related to thread setup.
- A few error handling fixes.
This should be the grunt of fixes necessary for the new workers, next
week should be quieter. We've got a pending series from Pavel on
cancelations, and how tasks and rings are indexed. Outside of that,
should just be minor fixes. Even with these fixes, we're still killing
a net ~80 lines"
* tag 'io_uring-5.12-2021-03-05' of git://git.kernel.dk/linux-block: (41 commits)
io_uring: don't restrict issue_flags for io_openat
io_uring: make SQPOLL thread parking saner
io-wq: kill hashed waitqueue before manager exits
io_uring: clear IOCB_WAITQ for non -EIOCBQUEUED return
io_uring: don't keep looping for more events if we can't flush overflow
io_uring: move to using create_io_thread()
kernel: provide create_io_thread() helper
io_uring: reliably cancel linked timeouts
io_uring: cancel-match based on flags
io-wq: ensure all pending work is canceled on exit
io_uring: ensure that threads freeze on suspend
io_uring: remove extra in_idle wake up
io_uring: inline __io_queue_async_work()
io_uring: inline io_req_clean_work()
io_uring: choose right tctx->io_wq for try cancel
io_uring: fix -EAGAIN retry with IOPOLL
io-wq: fix error path leak of buffered write hash map
io_uring: remove sqo_task
io_uring: kill sqo_dead and sqo submission halting
io_uring: ignore double poll add on the same waitqueue head
...
Pull power management fixes from Rafael Wysocki:
"These fix the usage of device links in the runtime PM core code and
update the DTPM (Dynamic Thermal Power Management) feature added
recently.
Specifics:
- Make the runtime PM core code avoid attempting to suspend supplier
devices before updating the PM-runtime status of a consumer to
'suspended' (Rafael Wysocki).
- Fix DTPM (Dynamic Thermal Power Management) root node
initialization and label that feature as EXPERIMENTAL in Kconfig
(Daniel Lezcano)"
* tag 'pm-5.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
powercap/drivers/dtpm: Add the experimental label to the option description
powercap/drivers/dtpm: Fix root node initialization
PM: runtime: Update device status before letting suppliers suspend
Pull ACPI fix from Rafael Wysocki:
"Make the empty stubs of some helper functions used when CONFIG_ACPI is
not set actually match those functions (Andy Shevchenko)"
* tag 'acpi-5.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI: bus: Constify is_acpi_node() and friends (part 2)
Pull iommu fixes from Joerg Roedel:
- Fix a sleeping-while-atomic issue in the AMD IOMMU code
- Disable lazy IOTLB flush for untrusted devices in the Intel VT-d
driver
- Fix status code definitions for Intel VT-d
- Fix IO Page Fault issue in Tegra IOMMU driver
* tag 'iommu-fixes-v5.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
iommu/vt-d: Fix status code for Allocate/Free PASID command
iommu: Don't use lazy flush for untrusted device
iommu/tegra-smmu: Fix mc errors on tegra124-nyan
iommu/amd: Fix sleeping in atomic in increase_address_space()
Pull btrfs fixes from David Sterba:
"More regression fixes and stabilization.
Regressions:
- zoned mode
- count zone sizes in wider int types
- fix space accounting for read-only block groups
- subpage: fix page tail zeroing
Fixes:
- fix spurious warning when remounting with free space tree
- fix warning when creating a directory with smack enabled
- ioctl checks for qgroup inheritance when creating a snapshot
- qgroup
- fix missing unlock on error path in zero range
- fix amount of released reservation on error
- fix flushing from unsafe context with open transaction,
potentially deadlocking
- minor build warning fixes"
* tag 'for-5.12-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: zoned: do not account freed region of read-only block group as zone_unusable
btrfs: zoned: use sector_t for zone sectors
btrfs: subpage: fix the false data csum mismatch error
btrfs: fix warning when creating a directory with smack enabled
btrfs: don't flush from btrfs_delayed_inode_reserve_metadata
btrfs: export and rename qgroup_reserve_meta
btrfs: free correct amount of space in btrfs_delayed_inode_reserve_metadata
btrfs: fix spurious free_space_tree remount warning
btrfs: validate qgroup inherit for SNAP_CREATE_V2 ioctl
btrfs: unlock extents in btrfs_zero_range in case of quota reservation errors
btrfs: ref-verify: use 'inline void' keyword ordering
Pull devicetree fixes from Rob Herring:
- Another batch of graph and video-interfaces schema conversions
- Drop DT header symlink for dropped C6X arch
- Fix bcm2711-hdmi schema error
* tag 'devicetree-fixes-for-5.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux:
dt-bindings: media: Use graph and video-interfaces schemas, round 2
dts: drop dangling c6x symlink
dt-bindings: bcm2711-hdmi: Fix broken schema
Pull tracing fixes from Steven Rostedt:
"Functional fixes:
- Fix big endian conversion for arm64 in recordmcount processing
- Fix timestamp corruption in ring buffer on discarding events
- Fix memory leak in __create_synth_event()
- Skip selftests if tracing is disabled as it will cause them to
fail.
Non-functional fixes:
- Fix help text in Kconfig
- Remove duplicate prototype for trace_empty()
- Fix stale comment about the trace_event_call flags.
Self test update:
- Add more information to the validation output of when a corrupt
timestamp is found in the ring buffer, and also trigger a warning
to make sure that tests catch it"
* tag 'trace-v5.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Fix comment about the trace_event_call flags
tracing: Skip selftests if tracing is disabled
tracing: Fix memory leak in __create_synth_event()
ring-buffer: Add a little more information and a WARN when time stamp going backwards is detected
ring-buffer: Force before_stamp and write_stamp to be different on discard
tracing: Fix help text of TRACEPOINT_BENCHMARK in Kconfig
tracing: Remove duplicate declaration from trace.h
ftrace: Have recordmcount use w8 to read relp->r_info in arm64_is_fake_mcount
In rxe_comp.c in rxe_completer() the function free_pkt() did not clear skb
which triggered a warning at 'done:' and could possibly at 'exit:'. The
WARN_ONCE() calls are not actually needed. The call to free_pkt() is
moved to the end to clearly show that all skbs are freed.
Fixes: 899aba891c ("RDMA/rxe: Fix FIXME in rxe_udp_encap_recv()")
Link: https://lore.kernel.org/r/20210304192048.2958-1-rpearson@hpe.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>