This patch continues the effort to refactor workqueue APIs, which has begun
with the changes introducing new workqueues and a new alloc_workqueue flag:
commit 128ea9f6cc ("workqueue: Add system_percpu_wq and system_dfl_wq")
commit 930c2ea566 ("workqueue: Add new WQ_PERCPU flag")
The point of the refactoring is to eventually alter the default behavior of
workqueues to become unbound by default so that their workload placement is
optimized by the scheduler.
Before that to happen, workqueue users must be converted to the better named
new workqueues with no intended behaviour changes:
system_wq -> system_percpu_wq
system_unbound_wq -> system_dfl_wq
This way the old obsolete workqueues (system_wq, system_unbound_wq) can be
removed in the future.
This specific driver already allocate an unbound workqueue named "rxe_wq",
so replace system_unbound_wq with this one instead of system_dfl_wq.
Link: https://lore.kernel.org/all/20250221112003.1dSuoGyc@linutronix.de/
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Link: https://patch.msgid.link/20260318152748.837388-1-marco.crivellari@suse.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
For GEN3 and higher, FW requires scratch buffers for bookkeeping
during cleanup, specifically during QP and MR destroy ops.
Signed-off-by: Jay Bhat <jay.bhat@intel.com>
Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Use the new API to support importing pinned dmabufs from exporters
that require revocation, such as VFIO. The revoke semantic is
achieved by issuing a HW invalidation command but not freeing
the key. This prevents further accesses to the region (they will
result in an invalid key AE), but also keeps the key reserved
until the region is actually deregistered (i.e., ibv_dereg_mr)
so that a new MR registration cannot acquire the same key.
Tested with lockdep+kasan and a memfd backed dmabuf.
The rereg_mr path is explicitly blocked in libibverbs for dmabuf MRs
(more specifically, any MR not of type IBV_MR_TYPE_MR), so the rereg_mr
path for dmabufs was tested with a modified libibverbs.
Signed-off-by: Jacob Moroni <jmoroni@google.com>
Link: https://patch.msgid.link/20260305170826.3803155-6-jmoroni@google.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Added helpers to acquire and release the umem dmabuf revoke
lock. The intent is to avoid the need for drivers to peek
into the ib_umem_dmabuf internals to get the dma_resv_lock
and bring us one step closer to abstracting ib_umem_dmabuf
away from drivers in general.
Signed-off-by: Jacob Moroni <jmoroni@google.com>
Link: https://patch.msgid.link/20260305170826.3803155-5-jmoroni@google.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Added an interface for importing a pinned but revocable dmabuf.
This interface can be used by drivers that are capable of revocation
so that they can import dmabufs from exporters that may require it,
such as VFIO.
This interface implements a two step process, where drivers will first
call ib_umem_dmabuf_get_pinned_revocable_and_lock() which will pin and
map the dmabuf (and provide a functional move_notify/invalidate_mappings
callback), but will return with the lock still held so that the
driver can then populate the callback via
ib_umem_dmabuf_set_revoke_locked() without races from concurrent
revocations. This scheme also allows for easier integration with drivers
that may not have actually allocated their internal MR objects at the time
of the get_pinned_revocable* call.
Signed-off-by: Jacob Moroni <jmoroni@google.com>
Link: https://patch.msgid.link/20260305170826.3803155-4-jmoroni@google.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
This continues the effort to refactor workqueue APIs, which began with
the introduction of new workqueues and a new alloc_workqueue flag in:
commit 128ea9f6cc ("workqueue: Add system_percpu_wq and system_dfl_wq")
commit 930c2ea566 ("workqueue: Add new WQ_PERCPU flag")
The refactoring is going to alter the default behavior of
alloc_workqueue() to be unbound by default.
With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND),
any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND
must now use WQ_PERCPU. For more details see the Link tag below.
In order to keep alloc_workqueue() behavior identical, explicitly request
WQ_PERCPU.
Link: https://lore.kernel.org/all/20250221112003.1dSuoGyc@linutronix.de/
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Link: https://patch.msgid.link/20260305154117.326472-1-marco.crivellari@suse.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
This patch supports application allocated memory for CQs.
The application allocates and manages the CQs directly. To support
this, the driver exports a new comp_mask to indicate direct control
of the CQ. When this comp_mask bit is set in the ureq, the driver
maps this application allocated CQ memory into hardware. As the
application manages this memory, the CQ depth ('cqe') passed by it
must be used as is and the driver shouldn't update it.
For CQs, ib_core supports pinning dmabuf based application memory,
specified through provider attributes. This umem is mananged by the
ib_core and is available in ib_cq. Register 'create_cq_user' devop
to process this umem. The driver also supports the legacy interface
that allocates umem internally.
Link: https://patch.msgid.link/r/20260302110036.36387-7-sriharsha.basavapatna@broadcom.com
Signed-off-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
Reviewed-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Some applications may need multiple doorbells to support parallel
processing of threads that each operate on a group of resources.
The following uapi methods have been implemented in this patch.
- BNXT_RE_METHOD_DBR_ALLOC:
This will allow the appliation to create extra doorbell regions
and use the associated doorbell page index in CREATE_QP and
use the associated DB address while ringing the doorbell.
- BNXT_RE_METHOD_DBR_FREE:
Free the allocated doorbell region.
- BNXT_RE_METHOD_GET_DEFAULT_DBR:
Return the default doorbell page index and doorbell page address
associated with the ucontext.
Link: https://patch.msgid.link/r/20260302110036.36387-4-sriharsha.basavapatna@broadcom.com
Co-developed-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
Signed-off-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Reviewed-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Inside bnxt_qplib_create_qp(), driver currently is doing
a lot of things like allocating HWQ memory for SQ/RQ/ORRQ/IRRQ,
initializing few of qplib_qp fields etc.
Refactored the code such that all memory allocation for HWQs
have been moved to bnxt_re_init_qp_attr() function and inside
bnxt_qplib_create_qp() function just initialize the request
structure and issue the HWRM command to firmware.
Introduced couple of new functions bnxt_re_setup_qp_hwqs() and
bnxt_re_setup_qp_swqs() moved the hwq and swq memory allocation
logic there.
Link: https://patch.msgid.link/r/20260302110036.36387-3-sriharsha.basavapatna@broadcom.com
Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Reviewed-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
This flag can be set by drivers once they have finished auditing and
implementing the full udata support on every udata operation.
My intention going forward is that driver authors proposing new udata uAPI
for their drivers must first do the work and set this flag.
If this flag is not set the userspace should not try to use udata based
uAPI newer than this commit, though on a case by case basis it may be OK
based on what checks historical kernels performed on the specific call.
Since bnxt_re is audited now, it is the first driver to set the flag.
Link: https://patch.msgid.link/r/13-v3-bd56dd443069+49-bnxt_re_uapi_jgg@nvidia.com
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Add a new function to consolidate the required compatibility pattern for
driver data of checking against a minimum size, and checking for unknown
trailing bytes to be zero into a function.
This new function uses the faster copy_struct_from_user() instead of
trying to directly check for zero.
Incorporate the common ibdev_dbg() logging directly into the error paths
of the helper.
Link: https://patch.msgid.link/r/3-v3-bd56dd443069+49-bnxt_re_uapi_jgg@nvidia.com
Tested-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
Acked-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Extend the VAR allocation UAPI to accept an optional flags attribute,
allowing userspace to request TLP VAR allocation via the
MLX5_IB_UAPI_VAR_ALLOC_FLAG_TLP flag.
When the TLP flag "MLX5_IB_UAPI_VAR_ALLOC_FLAG_TLP" is specified,
the driver selects the TLP VAR region for allocation instead of the
regular VirtIO VAR region.
Signed-off-by: Maher Sanalla <msanalla@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Add support for TLP (Transaction Layer Packet) VAR regions used by
software-defined device emulation. TLP VAR provides dedicated response
gateways for sending TLP responses back to the host in TLP emulation
scenarios.
Signed-off-by: Maher Sanalla <msanalla@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Extract mlx5_var_region struct from mlx5_var_table to enable
supporting multiple VAR regions in VAR table, which will be used in
the upcoming patches (Virtio emulation VAR and TLP emulation VAR).
Signed-off-by: Maher Sanalla <msanalla@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
This series adds support for Transaction Layer Packet (TLP) emulation
response gateway regions, enabling userspace device emulation software
to write TLP responses directly to lower layers without kernel driver
involvement.
Currently, the mlx5 driver exposes VirtIO emulation access regions via
the MLX5_IB_METHOD_VAR_OBJ_ALLOC ioctl. This series extends that
ioctl to also support allocating TLP response gateway channels for
PCI device emulation use cases.
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Introduce the hardware structures and definitions needed for the driver
support of TLP emulation in mlx5_ifc.
Signed-off-by: Maher Sanalla <msanalla@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Allow users to set FRMR pools aging timer through netlink.
This functionality will allow user to control how long handles reside in
the kernel before being destroyed, thus being able to tune the tradeoff
between memory and HW object consumption and memory registration
optimization.
Since FRMR pools is highly beneficial for application restart scenarios,
this command allows users to modify the aging timer to their application
restart time, making sure the FRMR handles deregistered on application
teardown are kept for long enough in the pools for reuse in the
application startup.
Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Reviewed-by: Patrisious Haddad <phaddad@nvidia.com>
Signed-off-by: Edward Srouji <edwards@nvidia.com>
Link: https://patch.msgid.link/20260226-frmr_pools-v4-9-95360b54f15e@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Use the new generic FRMR pools mechanism to optimize the performance of
memory registrations.
The move to the new generic FRMR pools will allow users configuring MR
cache through debugfs of MR cache to use the netlink API for FRMR pools
which will be added later in this series. Thus being able to have more
flexibility configuring the kernel and also being able to configure on
machines where debugfs is not available.
Mlx5_ib will save the mkey index as the handle in FRMR pools, same as the
MR cache implementation.
Upon each memory registration mlx5_ib will try to pull a handle from FRMR
pools and upon each deregistration it will push the handle back to it's
appropriate pool.
Use the vendor key field in umr pool key to save the access mode of the
mkey.
Use the option for kernel-only FRMR pool to manage the mkeys used for
registration with DMAH as the translation between UAPI of DMAH and the
mkey property of st_index is non-trivial and changing dynamically.
Since the value for no PH is 0xff and not zero, switch between them in
the frmr_key to have a zero'ed kernel_vendor_key when not using DMAH.
Remove the limitation we had with MR cache for mkeys up to 2^20 dma
blocks and support mkeys up to HW limitations according to caps.
Remove all MR cache related code.
Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Reviewed-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Edward Srouji <edwards@nvidia.com>
Link: https://patch.msgid.link/20260226-frmr_pools-v4-6-95360b54f15e@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>