Merge branch 'iommufd_dmabuf' into k.o-iommufd/for-next

Jason Gunthorpe says:

====================
This series is the start of adding full DMABUF support to
iommufd. Currently it is limited to only work with VFIO's DMABUF exporter.
It sits on top of Leon's series to add a DMABUF exporter to VFIO:

   https://lore.kernel.org/all/20251120-dmabuf-vfio-v9-0-d7f71607f371@nvidia.com/

The existing IOMMU_IOAS_MAP_FILE is enhanced to detect DMABUF fd's, but
otherwise works the same as it does today for a memfd. The user can select
a slice of the FD to map into the ioas and if the underliyng alignment
requirements are met it will be placed in the iommu_domain.

Though limited, it is enough to allow a VMM like QEMU to connect MMIO BAR
memory from VFIO to an iommu_domain controlled by iommufd. This is used
for PCI Peer to Peer support in VMs, and is the last feature that the VFIO
type 1 container has that iommufd couldn't do.

The VFIO type1 version extracts raw PFNs from VMAs, which has no lifetime
control and is a use-after-free security problem.

Instead iommufd relies on revokable DMABUFs. Whenever VFIO thinks there
should be no access to the MMIO it can shoot down the mapping in iommufd
which will unmap it from the iommu_domain. There is no automatic remap,
this is a safety protocol so the kernel doesn't get stuck. Userspace is
expected to know it is doing something that will revoke the dmabuf and
map/unmap it around the activity. Eg when QEMU goes to issue FLR it should
do the map/unmap to iommufd.

Since DMABUF is missing some key general features for this use case it
relies on a "private interconnect" between VFIO and iommufd via the
vfio_pci_dma_buf_iommufd_map() call.

The call confirms the DMABUF has revoke semantics and delivers a phys_addr
for the memory suitable for use with iommu_map().

Medium term there is a desire to expand the supported DMABUFs to include
GPU drivers to support DPDK/SPDK type use cases so future series will work
to add a general concept of revoke and a general negotiation of
interconnect to remove vfio_pci_dma_buf_iommufd_map().

I also plan another series to modify iommufd's vfio_compat to
transparently pull a dmabuf out of a VFIO VMA to emulate more of the uAPI
of type1.

The latest series for interconnect negotation to exchange a phys_addr is:
 https://lore.kernel.org/r/20251027044712.1676175-1-vivek.kasireddy@intel.com

And the discussion for design of revoke is here:
 https://lore.kernel.org/dri-devel/20250114173103.GE5556@nvidia.com/
====================

Based on a shared branch with vfio.

* iommufd_dmabuf:
  iommufd/selftest: Add some tests for the dmabuf flow
  iommufd: Accept a DMABUF through IOMMU_IOAS_MAP_FILE
  iommufd: Have iopt_map_file_pages convert the fd to a file
  iommufd: Have pfn_reader process DMABUF iopt_pages
  iommufd: Allow MMIO pages in a batch
  iommufd: Allow a DMABUF to be revoked
  iommufd: Do not map/unmap revoked DMABUFs
  iommufd: Add DMABUF to iopt_pages
  vfio/pci: Add vfio_pci_dma_buf_iommufd_map()
  vfio/nvgrace: Support get_dmabuf_phys
  vfio/pci: Add dma-buf export support for MMIO regions
  vfio/pci: Enable peer-to-peer DMA transactions by default
  vfio/pci: Share the core device pointer while invoking feature functions
  vfio: Export vfio device get and put registration helpers
  dma-buf: provide phys_vec to scatter-gather mapping routine
  PCI/P2PDMA: Document DMABUF model
  PCI/P2PDMA: Provide an access to pci_p2pdma_map_type() function
  PCI/P2PDMA: Refactor to separate core P2P functionality from memory allocation
  PCI/P2PDMA: Simplify bus address mapping API
  PCI/P2PDMA: Separate the mmap() support from the core logic

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
This commit is contained in:
Jason Gunthorpe
2025-11-25 12:46:52 -04:00
33 changed files with 1887 additions and 211 deletions

View File

@@ -9,22 +9,48 @@ between two devices on the bus. This type of transaction is henceforth
called Peer-to-Peer (or P2P). However, there are a number of issues that
make P2P transactions tricky to do in a perfectly safe way.
One of the biggest issues is that PCI doesn't require forwarding
transactions between hierarchy domains, and in PCIe, each Root Port
defines a separate hierarchy domain. To make things worse, there is no
simple way to determine if a given Root Complex supports this or not.
(See PCIe r4.0, sec 1.3.1). Therefore, as of this writing, the kernel
only supports doing P2P when the endpoints involved are all behind the
same PCI bridge, as such devices are all in the same PCI hierarchy
domain, and the spec guarantees that all transactions within the
hierarchy will be routable, but it does not require routing
between hierarchies.
For PCIe the routing of Transaction Layer Packets (TLPs) is well-defined up
until they reach a host bridge or root port. If the path includes PCIe switches
then based on the ACS settings the transaction can route entirely within
the PCIe hierarchy and never reach the root port. The kernel will evaluate
the PCIe topology and always permit P2P in these well-defined cases.
The second issue is that to make use of existing interfaces in Linux,
memory that is used for P2P transactions needs to be backed by struct
pages. However, PCI BARs are not typically cache coherent so there are
a few corner case gotchas with these pages so developers need to
be careful about what they do with them.
However, if the P2P transaction reaches the host bridge then it might have to
hairpin back out the same root port, be routed inside the CPU SOC to another
PCIe root port, or routed internally to the SOC.
The PCIe specification doesn't define the forwarding of transactions between
hierarchy domains and kernel defaults to blocking such routing. There is an
allow list to allow detecting known-good HW, in which case P2P between any
two PCIe devices will be permitted.
Since P2P inherently is doing transactions between two devices it requires two
drivers to be co-operating inside the kernel. The providing driver has to convey
its MMIO to the consuming driver. To meet the driver model lifecycle rules the
MMIO must have all DMA mapping removed, all CPU accesses prevented, all page
table mappings undone before the providing driver completes remove().
This requires the providing and consuming driver to actively work together to
guarantee that the consuming driver has stopped using the MMIO during a removal
cycle. This is done by either a synchronous invalidation shutdown or waiting
for all usage refcounts to reach zero.
At the lowest level the P2P subsystem offers a naked struct p2p_provider that
delegates lifecycle management to the providing driver. It is expected that
drivers using this option will wrap their MMIO memory in DMABUF and use DMABUF
to provide an invalidation shutdown. These MMIO addresess have no struct page, and
if used with mmap() must create special PTEs. As such there are very few
kernel uAPIs that can accept pointers to them; in particular they cannot be used
with read()/write(), including O_DIRECT.
Building on this, the subsystem offers a layer to wrap the MMIO in a ZONE_DEVICE
pgmap of MEMORY_DEVICE_PCI_P2PDMA to create struct pages. The lifecycle of
pgmap ensures that when the pgmap is destroyed all other drivers have stopped
using the MMIO. This option works with O_DIRECT flows, in some cases, if the
underlying subsystem supports handling MEMORY_DEVICE_PCI_P2PDMA through
FOLL_PCI_P2PDMA. The use of FOLL_LONGTERM is prevented. As this relies on pgmap
it also relies on architecture support along with alignment and minimum size
limitations.
Driver Writer's Guide
@@ -114,14 +140,39 @@ allocating scatter-gather lists with P2P memory.
Struct Page Caveats
-------------------
Driver writers should be very careful about not passing these special
struct pages to code that isn't prepared for it. At this time, the kernel
interfaces do not have any checks for ensuring this. This obviously
precludes passing these pages to userspace.
While the MEMORY_DEVICE_PCI_P2PDMA pages can be installed in VMAs,
pin_user_pages() and related will not return them unless FOLL_PCI_P2PDMA is set.
P2P memory is also technically IO memory but should never have any side
effects behind it. Thus, the order of loads and stores should not be important
and ioreadX(), iowriteX() and friends should not be necessary.
The MEMORY_DEVICE_PCI_P2PDMA pages require care to support in the kernel. The
KVA is still MMIO and must still be accessed through the normal
readX()/writeX()/etc helpers. Direct CPU access (e.g. memcpy) is forbidden, just
like any other MMIO mapping. While this will actually work on some
architectures, others will experience corruption or just crash in the kernel.
Supporting FOLL_PCI_P2PDMA in a subsystem requires scrubbing it to ensure no CPU
access happens.
Usage With DMABUF
=================
DMABUF provides an alternative to the above struct page-based
client/provider/orchestrator system and should be used when struct page
doesn't exist. In this mode the exporting driver will wrap
some of its MMIO in a DMABUF and give the DMABUF FD to userspace.
Userspace can then pass the FD to an importing driver which will ask the
exporting driver to map it to the importer.
In this case the initiator and target pci_devices are known and the P2P subsystem
is used to determine the mapping type. The phys_addr_t-based DMA API is used to
establish the dma_addr_t.
Lifecycle is controlled by DMABUF move_notify(). When the exporting driver wants
to remove() it must deliver an invalidation shutdown to all DMABUF importing
drivers through move_notify() and synchronously DMA unmap all the MMIO.
No importing driver can continue to have a DMA map to the MMIO after the
exporting driver has destroyed its p2p_provider.
P2P DMA Support Library

View File

@@ -85,7 +85,7 @@ static inline bool blk_can_dma_map_iova(struct request *req,
static bool blk_dma_map_bus(struct blk_dma_iter *iter, struct phys_vec *vec)
{
iter->addr = pci_p2pdma_bus_addr_map(&iter->p2pdma, vec->paddr);
iter->addr = pci_p2pdma_bus_addr_map(iter->p2pdma.mem, vec->paddr);
iter->len = vec->len;
return true;
}

View File

@@ -1,6 +1,6 @@
# SPDX-License-Identifier: GPL-2.0-only
obj-y := dma-buf.o dma-fence.o dma-fence-array.o dma-fence-chain.o \
dma-fence-unwrap.o dma-resv.o
dma-fence-unwrap.o dma-resv.o dma-buf-mapping.o
obj-$(CONFIG_DMABUF_HEAPS) += dma-heap.o
obj-$(CONFIG_DMABUF_HEAPS) += heaps/
obj-$(CONFIG_SYNC_FILE) += sync_file.o

View File

@@ -0,0 +1,248 @@
// SPDX-License-Identifier: GPL-2.0-only
/*
* DMA BUF Mapping Helpers
*
*/
#include <linux/dma-buf-mapping.h>
#include <linux/dma-resv.h>
static struct scatterlist *fill_sg_entry(struct scatterlist *sgl, size_t length,
dma_addr_t addr)
{
unsigned int len, nents;
int i;
nents = DIV_ROUND_UP(length, UINT_MAX);
for (i = 0; i < nents; i++) {
len = min_t(size_t, length, UINT_MAX);
length -= len;
/*
* DMABUF abuses scatterlist to create a scatterlist
* that does not have any CPU list, only the DMA list.
* Always set the page related values to NULL to ensure
* importers can't use it. The phys_addr based DMA API
* does not require the CPU list for mapping or unmapping.
*/
sg_set_page(sgl, NULL, 0, 0);
sg_dma_address(sgl) = addr + i * UINT_MAX;
sg_dma_len(sgl) = len;
sgl = sg_next(sgl);
}
return sgl;
}
static unsigned int calc_sg_nents(struct dma_iova_state *state,
struct dma_buf_phys_vec *phys_vec,
size_t nr_ranges, size_t size)
{
unsigned int nents = 0;
size_t i;
if (!state || !dma_use_iova(state)) {
for (i = 0; i < nr_ranges; i++)
nents += DIV_ROUND_UP(phys_vec[i].len, UINT_MAX);
} else {
/*
* In IOVA case, there is only one SG entry which spans
* for whole IOVA address space, but we need to make sure
* that it fits sg->length, maybe we need more.
*/
nents = DIV_ROUND_UP(size, UINT_MAX);
}
return nents;
}
/**
* struct dma_buf_dma - holds DMA mapping information
* @sgt: Scatter-gather table
* @state: DMA IOVA state relevant in IOMMU-based DMA
* @size: Total size of DMA transfer
*/
struct dma_buf_dma {
struct sg_table sgt;
struct dma_iova_state *state;
size_t size;
};
/**
* dma_buf_phys_vec_to_sgt - Returns the scatterlist table of the attachment
* from arrays of physical vectors. This funciton is intended for MMIO memory
* only.
* @attach: [in] attachment whose scatterlist is to be returned
* @provider: [in] p2pdma provider
* @phys_vec: [in] array of physical vectors
* @nr_ranges: [in] number of entries in phys_vec array
* @size: [in] total size of phys_vec
* @dir: [in] direction of DMA transfer
*
* Returns sg_table containing the scatterlist to be returned; returns ERR_PTR
* on error. May return -EINTR if it is interrupted by a signal.
*
* On success, the DMA addresses and lengths in the returned scatterlist are
* PAGE_SIZE aligned.
*
* A mapping must be unmapped by using dma_buf_free_sgt().
*
* NOTE: This function is intended for exporters. If direct traffic routing is
* mandatory exporter should call routing pci_p2pdma_map_type() before calling
* this function.
*/
struct sg_table *dma_buf_phys_vec_to_sgt(struct dma_buf_attachment *attach,
struct p2pdma_provider *provider,
struct dma_buf_phys_vec *phys_vec,
size_t nr_ranges, size_t size,
enum dma_data_direction dir)
{
unsigned int nents, mapped_len = 0;
struct dma_buf_dma *dma;
struct scatterlist *sgl;
dma_addr_t addr;
size_t i;
int ret;
dma_resv_assert_held(attach->dmabuf->resv);
if (WARN_ON(!attach || !attach->dmabuf || !provider))
/* This function is supposed to work on MMIO memory only */
return ERR_PTR(-EINVAL);
dma = kzalloc(sizeof(*dma), GFP_KERNEL);
if (!dma)
return ERR_PTR(-ENOMEM);
switch (pci_p2pdma_map_type(provider, attach->dev)) {
case PCI_P2PDMA_MAP_BUS_ADDR:
/*
* There is no need in IOVA at all for this flow.
*/
break;
case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
dma->state = kzalloc(sizeof(*dma->state), GFP_KERNEL);
if (!dma->state) {
ret = -ENOMEM;
goto err_free_dma;
}
dma_iova_try_alloc(attach->dev, dma->state, 0, size);
break;
default:
ret = -EINVAL;
goto err_free_dma;
}
nents = calc_sg_nents(dma->state, phys_vec, nr_ranges, size);
ret = sg_alloc_table(&dma->sgt, nents, GFP_KERNEL | __GFP_ZERO);
if (ret)
goto err_free_state;
sgl = dma->sgt.sgl;
for (i = 0; i < nr_ranges; i++) {
if (!dma->state) {
addr = pci_p2pdma_bus_addr_map(provider,
phys_vec[i].paddr);
} else if (dma_use_iova(dma->state)) {
ret = dma_iova_link(attach->dev, dma->state,
phys_vec[i].paddr, 0,
phys_vec[i].len, dir,
DMA_ATTR_MMIO);
if (ret)
goto err_unmap_dma;
mapped_len += phys_vec[i].len;
} else {
addr = dma_map_phys(attach->dev, phys_vec[i].paddr,
phys_vec[i].len, dir,
DMA_ATTR_MMIO);
ret = dma_mapping_error(attach->dev, addr);
if (ret)
goto err_unmap_dma;
}
if (!dma->state || !dma_use_iova(dma->state))
sgl = fill_sg_entry(sgl, phys_vec[i].len, addr);
}
if (dma->state && dma_use_iova(dma->state)) {
WARN_ON_ONCE(mapped_len != size);
ret = dma_iova_sync(attach->dev, dma->state, 0, mapped_len);
if (ret)
goto err_unmap_dma;
sgl = fill_sg_entry(sgl, mapped_len, dma->state->addr);
}
dma->size = size;
/*
* No CPU list included set orig_nents = 0 so others can detect
* this via SG table (use nents only).
*/
dma->sgt.orig_nents = 0;
/*
* SGL must be NULL to indicate that SGL is the last one
* and we allocated correct number of entries in sg_alloc_table()
*/
WARN_ON_ONCE(sgl);
return &dma->sgt;
err_unmap_dma:
if (!i || !dma->state) {
; /* Do nothing */
} else if (dma_use_iova(dma->state)) {
dma_iova_destroy(attach->dev, dma->state, mapped_len, dir,
DMA_ATTR_MMIO);
} else {
for_each_sgtable_dma_sg(&dma->sgt, sgl, i)
dma_unmap_phys(attach->dev, sg_dma_address(sgl),
sg_dma_len(sgl), dir, DMA_ATTR_MMIO);
}
sg_free_table(&dma->sgt);
err_free_state:
kfree(dma->state);
err_free_dma:
kfree(dma);
return ERR_PTR(ret);
}
EXPORT_SYMBOL_NS_GPL(dma_buf_phys_vec_to_sgt, "DMA_BUF");
/**
* dma_buf_free_sgt- unmaps the buffer
* @attach: [in] attachment to unmap buffer from
* @sgt: [in] scatterlist info of the buffer to unmap
* @dir: [in] direction of DMA transfer
*
* This unmaps a DMA mapping for @attached obtained
* by dma_buf_phys_vec_to_sgt().
*/
void dma_buf_free_sgt(struct dma_buf_attachment *attach, struct sg_table *sgt,
enum dma_data_direction dir)
{
struct dma_buf_dma *dma = container_of(sgt, struct dma_buf_dma, sgt);
int i;
dma_resv_assert_held(attach->dmabuf->resv);
if (!dma->state) {
; /* Do nothing */
} else if (dma_use_iova(dma->state)) {
dma_iova_destroy(attach->dev, dma->state, dma->size, dir,
DMA_ATTR_MMIO);
} else {
struct scatterlist *sgl;
for_each_sgtable_dma_sg(sgt, sgl, i)
dma_unmap_phys(attach->dev, sg_dma_address(sgl),
sg_dma_len(sgl), dir, DMA_ATTR_MMIO);
}
sg_free_table(sgt);
kfree(dma->state);
kfree(dma);
}
EXPORT_SYMBOL_NS_GPL(dma_buf_free_sgt, "DMA_BUF");

View File

@@ -1439,8 +1439,8 @@ int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
* as a bus address, __finalise_sg() will copy the dma
* address into the output segment.
*/
s->dma_address = pci_p2pdma_bus_addr_map(&p2pdma_state,
sg_phys(s));
s->dma_address = pci_p2pdma_bus_addr_map(
p2pdma_state.mem, sg_phys(s));
sg_dma_len(s) = sg->length;
sg_dma_mark_bus_address(s);
continue;

View File

@@ -8,8 +8,10 @@
* The datastructure uses the iopt_pages to optimize the storage of the PFNs
* between the domains and xarray.
*/
#include <linux/dma-buf.h>
#include <linux/err.h>
#include <linux/errno.h>
#include <linux/file.h>
#include <linux/iommu.h>
#include <linux/iommufd.h>
#include <linux/lockdep.h>
@@ -284,6 +286,9 @@ static int iopt_alloc_area_pages(struct io_pagetable *iopt,
case IOPT_ADDRESS_FILE:
start = elm->start_byte + elm->pages->start;
break;
case IOPT_ADDRESS_DMABUF:
start = elm->start_byte + elm->pages->dmabuf.start;
break;
}
rc = iopt_alloc_iova(iopt, dst_iova, start, length);
if (rc)
@@ -468,25 +473,53 @@ int iopt_map_user_pages(struct iommufd_ctx *ictx, struct io_pagetable *iopt,
* @iopt: io_pagetable to act on
* @iova: If IOPT_ALLOC_IOVA is set this is unused on input and contains
* the chosen iova on output. Otherwise is the iova to map to on input
* @file: file to map
* @fd: fdno of a file to map
* @start: map file starting at this byte offset
* @length: Number of bytes to map
* @iommu_prot: Combination of IOMMU_READ/WRITE/etc bits for the mapping
* @flags: IOPT_ALLOC_IOVA or zero
*/
int iopt_map_file_pages(struct iommufd_ctx *ictx, struct io_pagetable *iopt,
unsigned long *iova, struct file *file,
unsigned long start, unsigned long length,
int iommu_prot, unsigned int flags)
unsigned long *iova, int fd, unsigned long start,
unsigned long length, int iommu_prot,
unsigned int flags)
{
struct iopt_pages *pages;
struct dma_buf *dmabuf;
unsigned long start_byte;
unsigned long last;
if (!length)
return -EINVAL;
if (check_add_overflow(start, length - 1, &last))
return -EOVERFLOW;
start_byte = start - ALIGN_DOWN(start, PAGE_SIZE);
dmabuf = dma_buf_get(fd);
if (!IS_ERR(dmabuf)) {
pages = iopt_alloc_dmabuf_pages(ictx, dmabuf, start_byte, start,
length,
iommu_prot & IOMMU_WRITE);
if (IS_ERR(pages)) {
dma_buf_put(dmabuf);
return PTR_ERR(pages);
}
} else {
struct file *file;
file = fget(fd);
if (!file)
return -EBADF;
pages = iopt_alloc_file_pages(file, start_byte, start, length,
iommu_prot & IOMMU_WRITE);
fput(file);
if (IS_ERR(pages))
return PTR_ERR(pages);
}
pages = iopt_alloc_file_pages(file, start, length,
iommu_prot & IOMMU_WRITE);
if (IS_ERR(pages))
return PTR_ERR(pages);
return iopt_map_common(ictx, iopt, pages, iova, length,
start - pages->start, iommu_prot, flags);
start_byte, iommu_prot, flags);
}
struct iova_bitmap_fn_arg {
@@ -961,9 +994,15 @@ static void iopt_unfill_domain(struct io_pagetable *iopt,
WARN_ON(!area->storage_domain);
if (area->storage_domain == domain)
area->storage_domain = storage_domain;
if (iopt_is_dmabuf(pages)) {
if (!iopt_dmabuf_revoked(pages))
iopt_area_unmap_domain(area, domain);
iopt_dmabuf_untrack_domain(pages, area, domain);
}
mutex_unlock(&pages->mutex);
iopt_area_unmap_domain(area, domain);
if (!iopt_is_dmabuf(pages))
iopt_area_unmap_domain(area, domain);
}
return;
}
@@ -980,6 +1019,8 @@ static void iopt_unfill_domain(struct io_pagetable *iopt,
WARN_ON(area->storage_domain != domain);
area->storage_domain = NULL;
iopt_area_unfill_domain(area, pages, domain);
if (iopt_is_dmabuf(pages))
iopt_dmabuf_untrack_domain(pages, area, domain);
mutex_unlock(&pages->mutex);
}
}
@@ -1009,10 +1050,16 @@ static int iopt_fill_domain(struct io_pagetable *iopt,
if (!pages)
continue;
mutex_lock(&pages->mutex);
guard(mutex)(&pages->mutex);
if (iopt_is_dmabuf(pages)) {
rc = iopt_dmabuf_track_domain(pages, area, domain);
if (rc)
goto out_unfill;
}
rc = iopt_area_fill_domain(area, domain);
if (rc) {
mutex_unlock(&pages->mutex);
if (iopt_is_dmabuf(pages))
iopt_dmabuf_untrack_domain(pages, area, domain);
goto out_unfill;
}
if (!area->storage_domain) {
@@ -1021,7 +1068,6 @@ static int iopt_fill_domain(struct io_pagetable *iopt,
interval_tree_insert(&area->pages_node,
&pages->domains_itree);
}
mutex_unlock(&pages->mutex);
}
return 0;
@@ -1042,6 +1088,8 @@ static int iopt_fill_domain(struct io_pagetable *iopt,
area->storage_domain = NULL;
}
iopt_area_unfill_domain(area, pages, domain);
if (iopt_is_dmabuf(pages))
iopt_dmabuf_untrack_domain(pages, area, domain);
mutex_unlock(&pages->mutex);
}
return rc;
@@ -1252,6 +1300,10 @@ static int iopt_area_split(struct iopt_area *area, unsigned long iova)
if (!pages || area->prevent_access)
return -EBUSY;
/* Maintaining the domains_itree below is a bit complicated */
if (iopt_is_dmabuf(pages))
return -EOPNOTSUPP;
if (new_start & (alignment - 1) ||
iopt_area_start_byte(area, new_start) & (alignment - 1))
return -EINVAL;

View File

@@ -5,6 +5,7 @@
#ifndef __IO_PAGETABLE_H
#define __IO_PAGETABLE_H
#include <linux/dma-buf.h>
#include <linux/interval_tree.h>
#include <linux/kref.h>
#include <linux/mutex.h>
@@ -69,6 +70,16 @@ void iopt_area_unfill_domain(struct iopt_area *area, struct iopt_pages *pages,
void iopt_area_unmap_domain(struct iopt_area *area,
struct iommu_domain *domain);
int iopt_dmabuf_track_domain(struct iopt_pages *pages, struct iopt_area *area,
struct iommu_domain *domain);
void iopt_dmabuf_untrack_domain(struct iopt_pages *pages,
struct iopt_area *area,
struct iommu_domain *domain);
int iopt_dmabuf_track_all_domains(struct iopt_area *area,
struct iopt_pages *pages);
void iopt_dmabuf_untrack_all_domains(struct iopt_area *area,
struct iopt_pages *pages);
static inline unsigned long iopt_area_index(struct iopt_area *area)
{
return area->pages_node.start;
@@ -179,7 +190,22 @@ enum {
enum iopt_address_type {
IOPT_ADDRESS_USER = 0,
IOPT_ADDRESS_FILE = 1,
IOPT_ADDRESS_FILE,
IOPT_ADDRESS_DMABUF,
};
struct iopt_pages_dmabuf_track {
struct iommu_domain *domain;
struct iopt_area *area;
struct list_head elm;
};
struct iopt_pages_dmabuf {
struct dma_buf_attachment *attach;
struct dma_buf_phys_vec phys;
/* Always PAGE_SIZE aligned */
unsigned long start;
struct list_head tracker;
};
/*
@@ -209,6 +235,8 @@ struct iopt_pages {
struct file *file;
unsigned long start;
};
/* IOPT_ADDRESS_DMABUF */
struct iopt_pages_dmabuf dmabuf;
};
bool writable:1;
u8 account_mode;
@@ -220,10 +248,32 @@ struct iopt_pages {
struct rb_root_cached domains_itree;
};
static inline bool iopt_is_dmabuf(struct iopt_pages *pages)
{
if (!IS_ENABLED(CONFIG_DMA_SHARED_BUFFER))
return false;
return pages->type == IOPT_ADDRESS_DMABUF;
}
static inline bool iopt_dmabuf_revoked(struct iopt_pages *pages)
{
lockdep_assert_held(&pages->mutex);
if (iopt_is_dmabuf(pages))
return pages->dmabuf.phys.len == 0;
return false;
}
struct iopt_pages *iopt_alloc_user_pages(void __user *uptr,
unsigned long length, bool writable);
struct iopt_pages *iopt_alloc_file_pages(struct file *file, unsigned long start,
struct iopt_pages *iopt_alloc_file_pages(struct file *file,
unsigned long start_byte,
unsigned long start,
unsigned long length, bool writable);
struct iopt_pages *iopt_alloc_dmabuf_pages(struct iommufd_ctx *ictx,
struct dma_buf *dmabuf,
unsigned long start_byte,
unsigned long start,
unsigned long length, bool writable);
void iopt_release_pages(struct kref *kref);
static inline void iopt_put_pages(struct iopt_pages *pages)
{

View File

@@ -207,7 +207,6 @@ int iommufd_ioas_map_file(struct iommufd_ucmd *ucmd)
unsigned long iova = cmd->iova;
struct iommufd_ioas *ioas;
unsigned int flags = 0;
struct file *file;
int rc;
if (cmd->flags &
@@ -229,11 +228,7 @@ int iommufd_ioas_map_file(struct iommufd_ucmd *ucmd)
if (!(cmd->flags & IOMMU_IOAS_MAP_FIXED_IOVA))
flags = IOPT_ALLOC_IOVA;
file = fget(cmd->fd);
if (!file)
return -EBADF;
rc = iopt_map_file_pages(ucmd->ictx, &ioas->iopt, &iova, file,
rc = iopt_map_file_pages(ucmd->ictx, &ioas->iopt, &iova, cmd->fd,
cmd->start, cmd->length,
conv_iommu_prot(cmd->flags), flags);
if (rc)
@@ -243,7 +238,6 @@ int iommufd_ioas_map_file(struct iommufd_ucmd *ucmd)
rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));
out_put:
iommufd_put_object(ucmd->ictx, &ioas->obj);
fput(file);
return rc;
}

View File

@@ -19,6 +19,8 @@ struct iommu_domain;
struct iommu_group;
struct iommu_option;
struct iommufd_device;
struct dma_buf_attachment;
struct dma_buf_phys_vec;
struct iommufd_sw_msi_map {
struct list_head sw_msi_item;
@@ -108,7 +110,7 @@ int iopt_map_user_pages(struct iommufd_ctx *ictx, struct io_pagetable *iopt,
unsigned long length, int iommu_prot,
unsigned int flags);
int iopt_map_file_pages(struct iommufd_ctx *ictx, struct io_pagetable *iopt,
unsigned long *iova, struct file *file,
unsigned long *iova, int fd,
unsigned long start, unsigned long length,
int iommu_prot, unsigned int flags);
int iopt_map_pages(struct io_pagetable *iopt, struct list_head *pages_list,
@@ -504,6 +506,8 @@ void iommufd_device_pre_destroy(struct iommufd_object *obj);
void iommufd_device_destroy(struct iommufd_object *obj);
int iommufd_get_hw_info(struct iommufd_ucmd *ucmd);
struct device *iommufd_global_device(void);
struct iommufd_access {
struct iommufd_object obj;
struct iommufd_ctx *ictx;
@@ -711,6 +715,8 @@ bool iommufd_should_fail(void);
int __init iommufd_test_init(void);
void iommufd_test_exit(void);
bool iommufd_selftest_is_mock_dev(struct device *dev);
int iommufd_test_dma_buf_iommufd_map(struct dma_buf_attachment *attachment,
struct dma_buf_phys_vec *phys);
#else
static inline void iommufd_test_syz_conv_iova_id(struct iommufd_ucmd *ucmd,
unsigned int ioas_id,
@@ -732,5 +738,11 @@ static inline bool iommufd_selftest_is_mock_dev(struct device *dev)
{
return false;
}
static inline int
iommufd_test_dma_buf_iommufd_map(struct dma_buf_attachment *attachment,
struct dma_buf_phys_vec *phys)
{
return -EOPNOTSUPP;
}
#endif
#endif

View File

@@ -29,6 +29,8 @@ enum {
IOMMU_TEST_OP_PASID_REPLACE,
IOMMU_TEST_OP_PASID_DETACH,
IOMMU_TEST_OP_PASID_CHECK_HWPT,
IOMMU_TEST_OP_DMABUF_GET,
IOMMU_TEST_OP_DMABUF_REVOKE,
};
enum {
@@ -176,6 +178,14 @@ struct iommu_test_cmd {
__u32 hwpt_id;
/* @id is stdev_id */
} pasid_check;
struct {
__u32 length;
__u32 open_flags;
} dmabuf_get;
struct {
__s32 dmabuf_fd;
__u32 revoked;
} dmabuf_revoke;
};
__u32 last;
};

View File

@@ -751,6 +751,15 @@ static struct miscdevice vfio_misc_dev = {
.mode = 0666,
};
/*
* Used only by DMABUF, returns a valid struct device to use as a dummy struct
* device for attachment.
*/
struct device *iommufd_global_device(void)
{
return iommu_misc_dev.this_device;
}
static int __init iommufd_init(void)
{
int ret;
@@ -794,5 +803,6 @@ MODULE_ALIAS("devname:vfio/vfio");
#endif
MODULE_IMPORT_NS("IOMMUFD_INTERNAL");
MODULE_IMPORT_NS("IOMMUFD");
MODULE_IMPORT_NS("DMA_BUF");
MODULE_DESCRIPTION("I/O Address Space Management for passthrough devices");
MODULE_LICENSE("GPL");

View File

@@ -45,6 +45,8 @@
* last_iova + 1 can overflow. An iopt_pages index will always be much less than
* ULONG_MAX so last_index + 1 cannot overflow.
*/
#include <linux/dma-buf.h>
#include <linux/dma-resv.h>
#include <linux/file.h>
#include <linux/highmem.h>
#include <linux/iommu.h>
@@ -53,6 +55,7 @@
#include <linux/overflow.h>
#include <linux/slab.h>
#include <linux/sched/mm.h>
#include <linux/vfio_pci_core.h>
#include "double_span.h"
#include "io_pagetable.h"
@@ -258,6 +261,11 @@ static struct iopt_area *iopt_pages_find_domain_area(struct iopt_pages *pages,
return container_of(node, struct iopt_area, pages_node);
}
enum batch_kind {
BATCH_CPU_MEMORY = 0,
BATCH_MMIO,
};
/*
* A simple datastructure to hold a vector of PFNs, optimized for contiguous
* PFNs. This is used as a temporary holding memory for shuttling pfns from one
@@ -271,7 +279,9 @@ struct pfn_batch {
unsigned int array_size;
unsigned int end;
unsigned int total_pfns;
enum batch_kind kind;
};
enum { MAX_NPFNS = type_max(typeof(((struct pfn_batch *)0)->npfns[0])) };
static void batch_clear(struct pfn_batch *batch)
{
@@ -348,11 +358,17 @@ static void batch_destroy(struct pfn_batch *batch, void *backup)
}
static bool batch_add_pfn_num(struct pfn_batch *batch, unsigned long pfn,
u32 nr)
u32 nr, enum batch_kind kind)
{
const unsigned int MAX_NPFNS = type_max(typeof(*batch->npfns));
unsigned int end = batch->end;
if (batch->kind != kind) {
/* One kind per batch */
if (batch->end != 0)
return false;
batch->kind = kind;
}
if (end && pfn == batch->pfns[end - 1] + batch->npfns[end - 1] &&
nr <= MAX_NPFNS - batch->npfns[end - 1]) {
batch->npfns[end - 1] += nr;
@@ -379,7 +395,7 @@ static void batch_remove_pfn_num(struct pfn_batch *batch, unsigned long nr)
/* true if the pfn was added, false otherwise */
static bool batch_add_pfn(struct pfn_batch *batch, unsigned long pfn)
{
return batch_add_pfn_num(batch, pfn, 1);
return batch_add_pfn_num(batch, pfn, 1, BATCH_CPU_MEMORY);
}
/*
@@ -492,6 +508,7 @@ static int batch_to_domain(struct pfn_batch *batch, struct iommu_domain *domain,
{
bool disable_large_pages = area->iopt->disable_large_pages;
unsigned long last_iova = iopt_area_last_iova(area);
int iommu_prot = area->iommu_prot;
unsigned int page_offset = 0;
unsigned long start_iova;
unsigned long next_iova;
@@ -499,6 +516,11 @@ static int batch_to_domain(struct pfn_batch *batch, struct iommu_domain *domain,
unsigned long iova;
int rc;
if (batch->kind == BATCH_MMIO) {
iommu_prot &= ~IOMMU_CACHE;
iommu_prot |= IOMMU_MMIO;
}
/* The first index might be a partial page */
if (start_index == iopt_area_index(area))
page_offset = area->page_offset;
@@ -512,11 +534,11 @@ static int batch_to_domain(struct pfn_batch *batch, struct iommu_domain *domain,
rc = batch_iommu_map_small(
domain, iova,
PFN_PHYS(batch->pfns[cur]) + page_offset,
next_iova - iova, area->iommu_prot);
next_iova - iova, iommu_prot);
else
rc = iommu_map(domain, iova,
PFN_PHYS(batch->pfns[cur]) + page_offset,
next_iova - iova, area->iommu_prot,
next_iova - iova, iommu_prot,
GFP_KERNEL_ACCOUNT);
if (rc)
goto err_unmap;
@@ -652,7 +674,7 @@ static int batch_from_folios(struct pfn_batch *batch, struct folio ***folios_p,
nr = min(nr, npages);
npages -= nr;
if (!batch_add_pfn_num(batch, pfn, nr))
if (!batch_add_pfn_num(batch, pfn, nr, BATCH_CPU_MEMORY))
break;
if (nr > 1) {
rc = folio_add_pins(folio, nr - 1);
@@ -1054,6 +1076,41 @@ static int pfn_reader_user_update_pinned(struct pfn_reader_user *user,
return iopt_pages_update_pinned(pages, npages, inc, user);
}
struct pfn_reader_dmabuf {
struct dma_buf_phys_vec phys;
unsigned long start_offset;
};
static int pfn_reader_dmabuf_init(struct pfn_reader_dmabuf *dmabuf,
struct iopt_pages *pages)
{
/* Callers must not get here if the dmabuf was already revoked */
if (WARN_ON(iopt_dmabuf_revoked(pages)))
return -EINVAL;
dmabuf->phys = pages->dmabuf.phys;
dmabuf->start_offset = pages->dmabuf.start;
return 0;
}
static int pfn_reader_fill_dmabuf(struct pfn_reader_dmabuf *dmabuf,
struct pfn_batch *batch,
unsigned long start_index,
unsigned long last_index)
{
unsigned long start = dmabuf->start_offset + start_index * PAGE_SIZE;
/*
* start/last_index and start are all PAGE_SIZE aligned, the batch is
* always filled using page size aligned PFNs just like the other types.
* If the dmabuf has been sliced on a sub page offset then the common
* batch to domain code will adjust it before mapping to the domain.
*/
batch_add_pfn_num(batch, PHYS_PFN(dmabuf->phys.paddr + start),
last_index - start_index + 1, BATCH_MMIO);
return 0;
}
/*
* PFNs are stored in three places, in order of preference:
* - The iopt_pages xarray. This is only populated if there is a
@@ -1072,7 +1129,10 @@ struct pfn_reader {
unsigned long batch_end_index;
unsigned long last_index;
struct pfn_reader_user user;
union {
struct pfn_reader_user user;
struct pfn_reader_dmabuf dmabuf;
};
};
static int pfn_reader_update_pinned(struct pfn_reader *pfns)
@@ -1108,7 +1168,7 @@ static int pfn_reader_fill_span(struct pfn_reader *pfns)
{
struct interval_tree_double_span_iter *span = &pfns->span;
unsigned long start_index = pfns->batch_end_index;
struct pfn_reader_user *user = &pfns->user;
struct pfn_reader_user *user;
unsigned long npages;
struct iopt_area *area;
int rc;
@@ -1140,8 +1200,13 @@ static int pfn_reader_fill_span(struct pfn_reader *pfns)
return 0;
}
if (start_index >= pfns->user.upages_end) {
rc = pfn_reader_user_pin(&pfns->user, pfns->pages, start_index,
if (iopt_is_dmabuf(pfns->pages))
return pfn_reader_fill_dmabuf(&pfns->dmabuf, &pfns->batch,
start_index, span->last_hole);
user = &pfns->user;
if (start_index >= user->upages_end) {
rc = pfn_reader_user_pin(user, pfns->pages, start_index,
span->last_hole);
if (rc)
return rc;
@@ -1209,7 +1274,10 @@ static int pfn_reader_init(struct pfn_reader *pfns, struct iopt_pages *pages,
pfns->batch_start_index = start_index;
pfns->batch_end_index = start_index;
pfns->last_index = last_index;
pfn_reader_user_init(&pfns->user, pages);
if (iopt_is_dmabuf(pages))
pfn_reader_dmabuf_init(&pfns->dmabuf, pages);
else
pfn_reader_user_init(&pfns->user, pages);
rc = batch_init(&pfns->batch, last_index - start_index + 1);
if (rc)
return rc;
@@ -1230,8 +1298,12 @@ static int pfn_reader_init(struct pfn_reader *pfns, struct iopt_pages *pages,
static void pfn_reader_release_pins(struct pfn_reader *pfns)
{
struct iopt_pages *pages = pfns->pages;
struct pfn_reader_user *user = &pfns->user;
struct pfn_reader_user *user;
if (iopt_is_dmabuf(pages))
return;
user = &pfns->user;
if (user->upages_end > pfns->batch_end_index) {
/* Any pages not transferred to the batch are just unpinned */
@@ -1261,7 +1333,8 @@ static void pfn_reader_destroy(struct pfn_reader *pfns)
struct iopt_pages *pages = pfns->pages;
pfn_reader_release_pins(pfns);
pfn_reader_user_destroy(&pfns->user, pfns->pages);
if (!iopt_is_dmabuf(pfns->pages))
pfn_reader_user_destroy(&pfns->user, pfns->pages);
batch_destroy(&pfns->batch, NULL);
WARN_ON(pages->last_npinned != pages->npinned);
}
@@ -1340,26 +1413,234 @@ struct iopt_pages *iopt_alloc_user_pages(void __user *uptr,
return pages;
}
struct iopt_pages *iopt_alloc_file_pages(struct file *file, unsigned long start,
struct iopt_pages *iopt_alloc_file_pages(struct file *file,
unsigned long start_byte,
unsigned long start,
unsigned long length, bool writable)
{
struct iopt_pages *pages;
unsigned long start_down = ALIGN_DOWN(start, PAGE_SIZE);
unsigned long end;
if (length && check_add_overflow(start, length - 1, &end))
return ERR_PTR(-EOVERFLOW);
pages = iopt_alloc_pages(start - start_down, length, writable);
pages = iopt_alloc_pages(start_byte, length, writable);
if (IS_ERR(pages))
return pages;
pages->file = get_file(file);
pages->start = start_down;
pages->start = start - start_byte;
pages->type = IOPT_ADDRESS_FILE;
return pages;
}
static void iopt_revoke_notify(struct dma_buf_attachment *attach)
{
struct iopt_pages *pages = attach->importer_priv;
struct iopt_pages_dmabuf_track *track;
guard(mutex)(&pages->mutex);
if (iopt_dmabuf_revoked(pages))
return;
list_for_each_entry(track, &pages->dmabuf.tracker, elm) {
struct iopt_area *area = track->area;
iopt_area_unmap_domain_range(area, track->domain,
iopt_area_index(area),
iopt_area_last_index(area));
}
pages->dmabuf.phys.len = 0;
}
static struct dma_buf_attach_ops iopt_dmabuf_attach_revoke_ops = {
.allow_peer2peer = true,
.move_notify = iopt_revoke_notify,
};
/*
* iommufd and vfio have a circular dependency. Future work for a phys
* based private interconnect will remove this.
*/
static int
sym_vfio_pci_dma_buf_iommufd_map(struct dma_buf_attachment *attachment,
struct dma_buf_phys_vec *phys)
{
typeof(&vfio_pci_dma_buf_iommufd_map) fn;
int rc;
rc = iommufd_test_dma_buf_iommufd_map(attachment, phys);
if (rc != -EOPNOTSUPP)
return rc;
if (!IS_ENABLED(CONFIG_VFIO_PCI_DMABUF))
return -EOPNOTSUPP;
fn = symbol_get(vfio_pci_dma_buf_iommufd_map);
if (!fn)
return -EOPNOTSUPP;
rc = fn(attachment, phys);
symbol_put(vfio_pci_dma_buf_iommufd_map);
return rc;
}
static int iopt_map_dmabuf(struct iommufd_ctx *ictx, struct iopt_pages *pages,
struct dma_buf *dmabuf)
{
struct dma_buf_attachment *attach;
int rc;
attach = dma_buf_dynamic_attach(dmabuf, iommufd_global_device(),
&iopt_dmabuf_attach_revoke_ops, pages);
if (IS_ERR(attach))
return PTR_ERR(attach);
dma_resv_lock(dmabuf->resv, NULL);
/*
* Lock ordering requires the mutex to be taken inside the reservation,
* make sure lockdep sees this.
*/
if (IS_ENABLED(CONFIG_LOCKDEP)) {
mutex_lock(&pages->mutex);
mutex_unlock(&pages->mutex);
}
rc = sym_vfio_pci_dma_buf_iommufd_map(attach, &pages->dmabuf.phys);
if (rc)
goto err_detach;
dma_resv_unlock(dmabuf->resv);
/* On success iopt_release_pages() will detach and put the dmabuf. */
pages->dmabuf.attach = attach;
return 0;
err_detach:
dma_resv_unlock(dmabuf->resv);
dma_buf_detach(dmabuf, attach);
return rc;
}
struct iopt_pages *iopt_alloc_dmabuf_pages(struct iommufd_ctx *ictx,
struct dma_buf *dmabuf,
unsigned long start_byte,
unsigned long start,
unsigned long length, bool writable)
{
static struct lock_class_key pages_dmabuf_mutex_key;
struct iopt_pages *pages;
int rc;
if (!IS_ENABLED(CONFIG_DMA_SHARED_BUFFER))
return ERR_PTR(-EOPNOTSUPP);
if (dmabuf->size <= (start + length - 1) ||
length / PAGE_SIZE >= MAX_NPFNS)
return ERR_PTR(-EINVAL);
pages = iopt_alloc_pages(start_byte, length, writable);
if (IS_ERR(pages))
return pages;
/*
* The mmap_lock can be held when obtaining the dmabuf reservation lock
* which creates a locking cycle with the pages mutex which is held
* while obtaining the mmap_lock. This locking path is not present for
* IOPT_ADDRESS_DMABUF so split the lock class.
*/
lockdep_set_class(&pages->mutex, &pages_dmabuf_mutex_key);
/* dmabuf does not use pinned page accounting. */
pages->account_mode = IOPT_PAGES_ACCOUNT_NONE;
pages->type = IOPT_ADDRESS_DMABUF;
pages->dmabuf.start = start - start_byte;
INIT_LIST_HEAD(&pages->dmabuf.tracker);
rc = iopt_map_dmabuf(ictx, pages, dmabuf);
if (rc) {
iopt_put_pages(pages);
return ERR_PTR(rc);
}
return pages;
}
int iopt_dmabuf_track_domain(struct iopt_pages *pages, struct iopt_area *area,
struct iommu_domain *domain)
{
struct iopt_pages_dmabuf_track *track;
lockdep_assert_held(&pages->mutex);
if (WARN_ON(!iopt_is_dmabuf(pages)))
return -EINVAL;
list_for_each_entry(track, &pages->dmabuf.tracker, elm)
if (WARN_ON(track->domain == domain && track->area == area))
return -EINVAL;
track = kzalloc(sizeof(*track), GFP_KERNEL);
if (!track)
return -ENOMEM;
track->domain = domain;
track->area = area;
list_add_tail(&track->elm, &pages->dmabuf.tracker);
return 0;
}
void iopt_dmabuf_untrack_domain(struct iopt_pages *pages,
struct iopt_area *area,
struct iommu_domain *domain)
{
struct iopt_pages_dmabuf_track *track;
lockdep_assert_held(&pages->mutex);
WARN_ON(!iopt_is_dmabuf(pages));
list_for_each_entry(track, &pages->dmabuf.tracker, elm) {
if (track->domain == domain && track->area == area) {
list_del(&track->elm);
kfree(track);
return;
}
}
WARN_ON(true);
}
int iopt_dmabuf_track_all_domains(struct iopt_area *area,
struct iopt_pages *pages)
{
struct iopt_pages_dmabuf_track *track;
struct iommu_domain *domain;
unsigned long index;
int rc;
list_for_each_entry(track, &pages->dmabuf.tracker, elm)
if (WARN_ON(track->area == area))
return -EINVAL;
xa_for_each(&area->iopt->domains, index, domain) {
rc = iopt_dmabuf_track_domain(pages, area, domain);
if (rc)
goto err_untrack;
}
return 0;
err_untrack:
iopt_dmabuf_untrack_all_domains(area, pages);
return rc;
}
void iopt_dmabuf_untrack_all_domains(struct iopt_area *area,
struct iopt_pages *pages)
{
struct iopt_pages_dmabuf_track *track;
struct iopt_pages_dmabuf_track *tmp;
list_for_each_entry_safe(track, tmp, &pages->dmabuf.tracker,
elm) {
if (track->area == area) {
list_del(&track->elm);
kfree(track);
}
}
}
void iopt_release_pages(struct kref *kref)
{
struct iopt_pages *pages = container_of(kref, struct iopt_pages, kref);
@@ -1372,8 +1653,15 @@ void iopt_release_pages(struct kref *kref)
mutex_destroy(&pages->mutex);
put_task_struct(pages->source_task);
free_uid(pages->source_user);
if (pages->type == IOPT_ADDRESS_FILE)
if (iopt_is_dmabuf(pages) && pages->dmabuf.attach) {
struct dma_buf *dmabuf = pages->dmabuf.attach->dmabuf;
dma_buf_detach(dmabuf, pages->dmabuf.attach);
dma_buf_put(dmabuf);
WARN_ON(!list_empty(&pages->dmabuf.tracker));
} else if (pages->type == IOPT_ADDRESS_FILE) {
fput(pages->file);
}
kfree(pages);
}
@@ -1451,6 +1739,14 @@ static void __iopt_area_unfill_domain(struct iopt_area *area,
lockdep_assert_held(&pages->mutex);
if (iopt_is_dmabuf(pages)) {
if (WARN_ON(iopt_dmabuf_revoked(pages)))
return;
iopt_area_unmap_domain_range(area, domain, start_index,
last_index);
return;
}
/*
* For security we must not unpin something that is still DMA mapped,
* so this must unmap any IOVA before we go ahead and unpin the pages.
@@ -1526,6 +1822,9 @@ void iopt_area_unmap_domain(struct iopt_area *area, struct iommu_domain *domain)
void iopt_area_unfill_domain(struct iopt_area *area, struct iopt_pages *pages,
struct iommu_domain *domain)
{
if (iopt_dmabuf_revoked(pages))
return;
__iopt_area_unfill_domain(area, pages, domain,
iopt_area_last_index(area));
}
@@ -1546,6 +1845,9 @@ int iopt_area_fill_domain(struct iopt_area *area, struct iommu_domain *domain)
lockdep_assert_held(&area->pages->mutex);
if (iopt_dmabuf_revoked(area->pages))
return 0;
rc = pfn_reader_first(&pfns, area->pages, iopt_area_index(area),
iopt_area_last_index(area));
if (rc)
@@ -1605,33 +1907,44 @@ int iopt_area_fill_domains(struct iopt_area *area, struct iopt_pages *pages)
return 0;
mutex_lock(&pages->mutex);
rc = pfn_reader_first(&pfns, pages, iopt_area_index(area),
iopt_area_last_index(area));
if (rc)
goto out_unlock;
if (iopt_is_dmabuf(pages)) {
rc = iopt_dmabuf_track_all_domains(area, pages);
if (rc)
goto out_unlock;
}
while (!pfn_reader_done(&pfns)) {
done_first_end_index = pfns.batch_end_index;
done_all_end_index = pfns.batch_start_index;
xa_for_each(&area->iopt->domains, index, domain) {
rc = batch_to_domain(&pfns.batch, domain, area,
pfns.batch_start_index);
if (!iopt_dmabuf_revoked(pages)) {
rc = pfn_reader_first(&pfns, pages, iopt_area_index(area),
iopt_area_last_index(area));
if (rc)
goto out_untrack;
while (!pfn_reader_done(&pfns)) {
done_first_end_index = pfns.batch_end_index;
done_all_end_index = pfns.batch_start_index;
xa_for_each(&area->iopt->domains, index, domain) {
rc = batch_to_domain(&pfns.batch, domain, area,
pfns.batch_start_index);
if (rc)
goto out_unmap;
}
done_all_end_index = done_first_end_index;
rc = pfn_reader_next(&pfns);
if (rc)
goto out_unmap;
}
done_all_end_index = done_first_end_index;
rc = pfn_reader_next(&pfns);
rc = pfn_reader_update_pinned(&pfns);
if (rc)
goto out_unmap;
pfn_reader_destroy(&pfns);
}
rc = pfn_reader_update_pinned(&pfns);
if (rc)
goto out_unmap;
area->storage_domain = xa_load(&area->iopt->domains, 0);
interval_tree_insert(&area->pages_node, &pages->domains_itree);
goto out_destroy;
mutex_unlock(&pages->mutex);
return 0;
out_unmap:
pfn_reader_release_pins(&pfns);
@@ -1658,8 +1971,10 @@ int iopt_area_fill_domains(struct iopt_area *area, struct iopt_pages *pages)
end_index);
}
}
out_destroy:
pfn_reader_destroy(&pfns);
out_untrack:
if (iopt_is_dmabuf(pages))
iopt_dmabuf_untrack_all_domains(area, pages);
out_unlock:
mutex_unlock(&pages->mutex);
return rc;
@@ -1685,16 +2000,22 @@ void iopt_area_unfill_domains(struct iopt_area *area, struct iopt_pages *pages)
if (!area->storage_domain)
goto out_unlock;
xa_for_each(&iopt->domains, index, domain)
if (domain != area->storage_domain)
xa_for_each(&iopt->domains, index, domain) {
if (domain == area->storage_domain)
continue;
if (!iopt_dmabuf_revoked(pages))
iopt_area_unmap_domain_range(
area, domain, iopt_area_index(area),
iopt_area_last_index(area));
}
if (IS_ENABLED(CONFIG_IOMMUFD_TEST))
WARN_ON(RB_EMPTY_NODE(&area->pages_node.rb));
interval_tree_remove(&area->pages_node, &pages->domains_itree);
iopt_area_unfill_domain(area, pages, area->storage_domain);
if (iopt_is_dmabuf(pages))
iopt_dmabuf_untrack_all_domains(area, pages);
area->storage_domain = NULL;
out_unlock:
mutex_unlock(&pages->mutex);
@@ -2031,15 +2352,14 @@ int iopt_pages_rw_access(struct iopt_pages *pages, unsigned long start_byte,
if ((flags & IOMMUFD_ACCESS_RW_WRITE) && !pages->writable)
return -EPERM;
if (pages->type == IOPT_ADDRESS_FILE)
if (iopt_is_dmabuf(pages))
return -EINVAL;
if (pages->type != IOPT_ADDRESS_USER)
return iopt_pages_rw_slow(pages, start_index, last_index,
start_byte % PAGE_SIZE, data, length,
flags);
if (IS_ENABLED(CONFIG_IOMMUFD_TEST) &&
WARN_ON(pages->type != IOPT_ADDRESS_USER))
return -EINVAL;
if (!(flags & IOMMUFD_ACCESS_RW_KTHREAD) && change_mm) {
if (start_index == last_index)
return iopt_pages_rw_page(pages, start_index,

View File

@@ -5,6 +5,8 @@
*/
#include <linux/anon_inodes.h>
#include <linux/debugfs.h>
#include <linux/dma-buf.h>
#include <linux/dma-resv.h>
#include <linux/fault-inject.h>
#include <linux/file.h>
#include <linux/iommu.h>
@@ -2031,6 +2033,140 @@ void iommufd_selftest_destroy(struct iommufd_object *obj)
}
}
struct iommufd_test_dma_buf {
void *memory;
size_t length;
bool revoked;
};
static int iommufd_test_dma_buf_attach(struct dma_buf *dmabuf,
struct dma_buf_attachment *attachment)
{
return 0;
}
static void iommufd_test_dma_buf_detach(struct dma_buf *dmabuf,
struct dma_buf_attachment *attachment)
{
}
static struct sg_table *
iommufd_test_dma_buf_map(struct dma_buf_attachment *attachment,
enum dma_data_direction dir)
{
return ERR_PTR(-EOPNOTSUPP);
}
static void iommufd_test_dma_buf_unmap(struct dma_buf_attachment *attachment,
struct sg_table *sgt,
enum dma_data_direction dir)
{
}
static void iommufd_test_dma_buf_release(struct dma_buf *dmabuf)
{
struct iommufd_test_dma_buf *priv = dmabuf->priv;
kfree(priv->memory);
kfree(priv);
}
static const struct dma_buf_ops iommufd_test_dmabuf_ops = {
.attach = iommufd_test_dma_buf_attach,
.detach = iommufd_test_dma_buf_detach,
.map_dma_buf = iommufd_test_dma_buf_map,
.release = iommufd_test_dma_buf_release,
.unmap_dma_buf = iommufd_test_dma_buf_unmap,
};
int iommufd_test_dma_buf_iommufd_map(struct dma_buf_attachment *attachment,
struct dma_buf_phys_vec *phys)
{
struct iommufd_test_dma_buf *priv = attachment->dmabuf->priv;
dma_resv_assert_held(attachment->dmabuf->resv);
if (attachment->dmabuf->ops != &iommufd_test_dmabuf_ops)
return -EOPNOTSUPP;
if (priv->revoked)
return -ENODEV;
phys->paddr = virt_to_phys(priv->memory);
phys->len = priv->length;
return 0;
}
static int iommufd_test_dmabuf_get(struct iommufd_ucmd *ucmd,
unsigned int open_flags,
size_t len)
{
DEFINE_DMA_BUF_EXPORT_INFO(exp_info);
struct iommufd_test_dma_buf *priv;
struct dma_buf *dmabuf;
int rc;
len = ALIGN(len, PAGE_SIZE);
if (len == 0 || len > PAGE_SIZE * 512)
return -EINVAL;
priv = kzalloc(sizeof(*priv), GFP_KERNEL);
if (!priv)
return -ENOMEM;
priv->length = len;
priv->memory = kzalloc(len, GFP_KERNEL);
if (!priv->memory) {
rc = -ENOMEM;
goto err_free;
}
exp_info.ops = &iommufd_test_dmabuf_ops;
exp_info.size = len;
exp_info.flags = open_flags;
exp_info.priv = priv;
dmabuf = dma_buf_export(&exp_info);
if (IS_ERR(dmabuf)) {
rc = PTR_ERR(dmabuf);
goto err_free;
}
return dma_buf_fd(dmabuf, open_flags);
err_free:
kfree(priv->memory);
kfree(priv);
return rc;
}
static int iommufd_test_dmabuf_revoke(struct iommufd_ucmd *ucmd, int fd,
bool revoked)
{
struct iommufd_test_dma_buf *priv;
struct dma_buf *dmabuf;
int rc = 0;
dmabuf = dma_buf_get(fd);
if (IS_ERR(dmabuf))
return PTR_ERR(dmabuf);
if (dmabuf->ops != &iommufd_test_dmabuf_ops) {
rc = -EOPNOTSUPP;
goto err_put;
}
priv = dmabuf->priv;
dma_resv_lock(dmabuf->resv, NULL);
priv->revoked = revoked;
dma_buf_move_notify(dmabuf);
dma_resv_unlock(dmabuf->resv);
err_put:
dma_buf_put(dmabuf);
return rc;
}
int iommufd_test(struct iommufd_ucmd *ucmd)
{
struct iommu_test_cmd *cmd = ucmd->cmd;
@@ -2109,6 +2245,13 @@ int iommufd_test(struct iommufd_ucmd *ucmd)
return iommufd_test_pasid_detach(ucmd, cmd);
case IOMMU_TEST_OP_PASID_CHECK_HWPT:
return iommufd_test_pasid_check_hwpt(ucmd, cmd);
case IOMMU_TEST_OP_DMABUF_GET:
return iommufd_test_dmabuf_get(ucmd, cmd->dmabuf_get.open_flags,
cmd->dmabuf_get.length);
case IOMMU_TEST_OP_DMABUF_REVOKE:
return iommufd_test_dmabuf_revoke(ucmd,
cmd->dmabuf_revoke.dmabuf_fd,
cmd->dmabuf_revoke.revoked);
default:
return -EOPNOTSUPP;
}

View File

@@ -25,12 +25,12 @@ struct pci_p2pdma {
struct gen_pool *pool;
bool p2pmem_published;
struct xarray map_types;
struct p2pdma_provider mem[PCI_STD_NUM_BARS];
};
struct pci_p2pdma_pagemap {
struct pci_dev *provider;
u64 bus_offset;
struct dev_pagemap pgmap;
struct p2pdma_provider *mem;
};
static struct pci_p2pdma_pagemap *to_p2p_pgmap(struct dev_pagemap *pgmap)
@@ -204,8 +204,8 @@ static void p2pdma_page_free(struct page *page)
{
struct pci_p2pdma_pagemap *pgmap = to_p2p_pgmap(page_pgmap(page));
/* safe to dereference while a reference is held to the percpu ref */
struct pci_p2pdma *p2pdma =
rcu_dereference_protected(pgmap->provider->p2pdma, 1);
struct pci_p2pdma *p2pdma = rcu_dereference_protected(
to_pci_dev(pgmap->mem->owner)->p2pdma, 1);
struct percpu_ref *ref;
gen_pool_free_owner(p2pdma->pool, (uintptr_t)page_to_virt(page),
@@ -228,56 +228,136 @@ static void pci_p2pdma_release(void *data)
/* Flush and disable pci_alloc_p2p_mem() */
pdev->p2pdma = NULL;
synchronize_rcu();
if (p2pdma->pool)
synchronize_rcu();
xa_destroy(&p2pdma->map_types);
if (!p2pdma->pool)
return;
gen_pool_destroy(p2pdma->pool);
sysfs_remove_group(&pdev->dev.kobj, &p2pmem_group);
xa_destroy(&p2pdma->map_types);
}
static int pci_p2pdma_setup(struct pci_dev *pdev)
/**
* pcim_p2pdma_init - Initialise peer-to-peer DMA providers
* @pdev: The PCI device to enable P2PDMA for
*
* This function initializes the peer-to-peer DMA infrastructure
* for a PCI device. It allocates and sets up the necessary data
* structures to support P2PDMA operations, including mapping type
* tracking.
*/
int pcim_p2pdma_init(struct pci_dev *pdev)
{
int error = -ENOMEM;
struct pci_p2pdma *p2p;
int i, ret;
p2p = rcu_dereference_protected(pdev->p2pdma, 1);
if (p2p)
return 0;
p2p = devm_kzalloc(&pdev->dev, sizeof(*p2p), GFP_KERNEL);
if (!p2p)
return -ENOMEM;
xa_init(&p2p->map_types);
/*
* Iterate over all standard PCI BARs and record only those that
* correspond to MMIO regions. Skip non-memory resources (e.g. I/O
* port BARs) since they cannot be used for peer-to-peer (P2P)
* transactions.
*/
for (i = 0; i < PCI_STD_NUM_BARS; i++) {
if (!(pci_resource_flags(pdev, i) & IORESOURCE_MEM))
continue;
p2p->pool = gen_pool_create(PAGE_SHIFT, dev_to_node(&pdev->dev));
if (!p2p->pool)
goto out;
p2p->mem[i].owner = &pdev->dev;
p2p->mem[i].bus_offset =
pci_bus_address(pdev, i) - pci_resource_start(pdev, i);
}
error = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_release, pdev);
if (error)
goto out_pool_destroy;
error = sysfs_create_group(&pdev->dev.kobj, &p2pmem_group);
if (error)
goto out_pool_destroy;
ret = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_release, pdev);
if (ret)
goto out_p2p;
rcu_assign_pointer(pdev->p2pdma, p2p);
return 0;
out_pool_destroy:
gen_pool_destroy(p2p->pool);
out:
out_p2p:
devm_kfree(&pdev->dev, p2p);
return error;
return ret;
}
EXPORT_SYMBOL_GPL(pcim_p2pdma_init);
/**
* pcim_p2pdma_provider - Get peer-to-peer DMA provider
* @pdev: The PCI device to enable P2PDMA for
* @bar: BAR index to get provider
*
* This function gets peer-to-peer DMA provider for a PCI device. The lifetime
* of the provider (and of course the MMIO) is bound to the lifetime of the
* driver. A driver calling this function must ensure that all references to the
* provider, and any DMA mappings created for any MMIO, are all cleaned up
* before the driver remove() completes.
*
* Since P2P is almost always shared with a second driver this means some system
* to notify, invalidate and revoke the MMIO's DMA must be in place to use this
* function. For example a revoke can be built using DMABUF.
*/
struct p2pdma_provider *pcim_p2pdma_provider(struct pci_dev *pdev, int bar)
{
struct pci_p2pdma *p2p;
if (!(pci_resource_flags(pdev, bar) & IORESOURCE_MEM))
return NULL;
p2p = rcu_dereference_protected(pdev->p2pdma, 1);
if (WARN_ON(!p2p))
/* Someone forgot to call to pcim_p2pdma_init() before */
return NULL;
return &p2p->mem[bar];
}
EXPORT_SYMBOL_GPL(pcim_p2pdma_provider);
static int pci_p2pdma_setup_pool(struct pci_dev *pdev)
{
struct pci_p2pdma *p2pdma;
int ret;
p2pdma = rcu_dereference_protected(pdev->p2pdma, 1);
if (p2pdma->pool)
/* We already setup pools, do nothing, */
return 0;
p2pdma->pool = gen_pool_create(PAGE_SHIFT, dev_to_node(&pdev->dev));
if (!p2pdma->pool)
return -ENOMEM;
ret = sysfs_create_group(&pdev->dev.kobj, &p2pmem_group);
if (ret)
goto out_pool_destroy;
return 0;
out_pool_destroy:
gen_pool_destroy(p2pdma->pool);
p2pdma->pool = NULL;
return ret;
}
static void pci_p2pdma_unmap_mappings(void *data)
{
struct pci_dev *pdev = data;
struct pci_p2pdma_pagemap *p2p_pgmap = data;
/*
* Removing the alloc attribute from sysfs will call
* unmap_mapping_range() on the inode, teardown any existing userspace
* mappings and prevent new ones from being created.
*/
sysfs_remove_file_from_group(&pdev->dev.kobj, &p2pmem_alloc_attr.attr,
sysfs_remove_file_from_group(&p2p_pgmap->mem->owner->kobj,
&p2pmem_alloc_attr.attr,
p2pmem_group.name);
}
@@ -295,6 +375,7 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
u64 offset)
{
struct pci_p2pdma_pagemap *p2p_pgmap;
struct p2pdma_provider *mem;
struct dev_pagemap *pgmap;
struct pci_p2pdma *p2pdma;
void *addr;
@@ -312,11 +393,21 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
if (size + offset > pci_resource_len(pdev, bar))
return -EINVAL;
if (!pdev->p2pdma) {
error = pci_p2pdma_setup(pdev);
if (error)
return error;
}
error = pcim_p2pdma_init(pdev);
if (error)
return error;
error = pci_p2pdma_setup_pool(pdev);
if (error)
return error;
mem = pcim_p2pdma_provider(pdev, bar);
/*
* We checked validity of BAR prior to call
* to pcim_p2pdma_provider. It should never return NULL.
*/
if (WARN_ON(!mem))
return -EINVAL;
p2p_pgmap = devm_kzalloc(&pdev->dev, sizeof(*p2p_pgmap), GFP_KERNEL);
if (!p2p_pgmap)
@@ -328,10 +419,7 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
pgmap->nr_range = 1;
pgmap->type = MEMORY_DEVICE_PCI_P2PDMA;
pgmap->ops = &p2pdma_pgmap_ops;
p2p_pgmap->provider = pdev;
p2p_pgmap->bus_offset = pci_bus_address(pdev, bar) -
pci_resource_start(pdev, bar);
p2p_pgmap->mem = mem;
addr = devm_memremap_pages(&pdev->dev, pgmap);
if (IS_ERR(addr)) {
@@ -340,7 +428,7 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
}
error = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_unmap_mappings,
pdev);
p2p_pgmap);
if (error)
goto pages_free;
@@ -972,16 +1060,26 @@ void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
}
EXPORT_SYMBOL_GPL(pci_p2pmem_publish);
static enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap,
struct device *dev)
/**
* pci_p2pdma_map_type - Determine the mapping type for P2PDMA transfers
* @provider: P2PDMA provider structure
* @dev: Target device for the transfer
*
* Determines how peer-to-peer DMA transfers should be mapped between
* the provider and the target device. The mapping type indicates whether
* the transfer can be done directly through PCI switches or must go
* through the host bridge.
*/
enum pci_p2pdma_map_type pci_p2pdma_map_type(struct p2pdma_provider *provider,
struct device *dev)
{
enum pci_p2pdma_map_type type = PCI_P2PDMA_MAP_NOT_SUPPORTED;
struct pci_dev *provider = to_p2p_pgmap(pgmap)->provider;
struct pci_dev *pdev = to_pci_dev(provider->owner);
struct pci_dev *client;
struct pci_p2pdma *p2pdma;
int dist;
if (!provider->p2pdma)
if (!pdev->p2pdma)
return PCI_P2PDMA_MAP_NOT_SUPPORTED;
if (!dev_is_pci(dev))
@@ -990,7 +1088,7 @@ static enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap,
client = to_pci_dev(dev);
rcu_read_lock();
p2pdma = rcu_dereference(provider->p2pdma);
p2pdma = rcu_dereference(pdev->p2pdma);
if (p2pdma)
type = xa_to_value(xa_load(&p2pdma->map_types,
@@ -998,7 +1096,7 @@ static enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap,
rcu_read_unlock();
if (type == PCI_P2PDMA_MAP_UNKNOWN)
return calc_map_type_and_dist(provider, client, &dist, true);
return calc_map_type_and_dist(pdev, client, &dist, true);
return type;
}
@@ -1006,9 +1104,13 @@ static enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap,
void __pci_p2pdma_update_state(struct pci_p2pdma_map_state *state,
struct device *dev, struct page *page)
{
state->pgmap = page_pgmap(page);
state->map = pci_p2pdma_map_type(state->pgmap, dev);
state->bus_off = to_p2p_pgmap(state->pgmap)->bus_offset;
struct pci_p2pdma_pagemap *p2p_pgmap = to_p2p_pgmap(page_pgmap(page));
if (state->mem == p2p_pgmap->mem)
return;
state->mem = p2p_pgmap->mem;
state->map = pci_p2pdma_map_type(p2p_pgmap->mem, dev);
}
/**

View File

@@ -55,6 +55,9 @@ config VFIO_PCI_ZDEV_KVM
To enable s390x KVM vfio-pci extensions, say Y.
config VFIO_PCI_DMABUF
def_bool y if VFIO_PCI_CORE && PCI_P2PDMA && DMA_SHARED_BUFFER
source "drivers/vfio/pci/mlx5/Kconfig"
source "drivers/vfio/pci/hisilicon/Kconfig"

View File

@@ -2,6 +2,7 @@
vfio-pci-core-y := vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
vfio-pci-core-$(CONFIG_VFIO_PCI_ZDEV_KVM) += vfio_pci_zdev.o
vfio-pci-core-$(CONFIG_VFIO_PCI_DMABUF) += vfio_pci_dmabuf.o
obj-$(CONFIG_VFIO_PCI_CORE) += vfio-pci-core.o
vfio-pci-y := vfio_pci.o

View File

@@ -7,6 +7,7 @@
#include <linux/vfio_pci_core.h>
#include <linux/delay.h>
#include <linux/jiffies.h>
#include <linux/pci-p2pdma.h>
/*
* The device memory usable to the workloads running in the VM is cached
@@ -683,6 +684,50 @@ nvgrace_gpu_write(struct vfio_device *core_vdev,
return vfio_pci_core_write(core_vdev, buf, count, ppos);
}
static int nvgrace_get_dmabuf_phys(struct vfio_pci_core_device *core_vdev,
struct p2pdma_provider **provider,
unsigned int region_index,
struct dma_buf_phys_vec *phys_vec,
struct vfio_region_dma_range *dma_ranges,
size_t nr_ranges)
{
struct nvgrace_gpu_pci_core_device *nvdev = container_of(
core_vdev, struct nvgrace_gpu_pci_core_device, core_device);
struct pci_dev *pdev = core_vdev->pdev;
struct mem_region *mem_region;
/*
* if (nvdev->resmem.memlength && region_index == RESMEM_REGION_INDEX) {
* The P2P properties of the non-BAR memory is the same as the
* BAR memory, so just use the provider for index 0. Someday
* when CXL gets P2P support we could create CXLish providers
* for the non-BAR memory.
* } else if (region_index == USEMEM_REGION_INDEX) {
* This is actually cachable memory and isn't treated as P2P in
* the chip. For now we have no way to push cachable memory
* through everything and the Grace HW doesn't care what caching
* attribute is programmed into the SMMU. So use BAR 0.
* }
*/
mem_region = nvgrace_gpu_memregion(region_index, nvdev);
if (mem_region) {
*provider = pcim_p2pdma_provider(pdev, 0);
if (!*provider)
return -EINVAL;
return vfio_pci_core_fill_phys_vec(phys_vec, dma_ranges,
nr_ranges,
mem_region->memphys,
mem_region->memlength);
}
return vfio_pci_core_get_dmabuf_phys(core_vdev, provider, region_index,
phys_vec, dma_ranges, nr_ranges);
}
static const struct vfio_pci_device_ops nvgrace_gpu_pci_dev_ops = {
.get_dmabuf_phys = nvgrace_get_dmabuf_phys,
};
static const struct vfio_device_ops nvgrace_gpu_pci_ops = {
.name = "nvgrace-gpu-vfio-pci",
.init = vfio_pci_core_init_dev,
@@ -703,6 +748,10 @@ static const struct vfio_device_ops nvgrace_gpu_pci_ops = {
.detach_ioas = vfio_iommufd_physical_detach_ioas,
};
static const struct vfio_pci_device_ops nvgrace_gpu_pci_dev_core_ops = {
.get_dmabuf_phys = vfio_pci_core_get_dmabuf_phys,
};
static const struct vfio_device_ops nvgrace_gpu_pci_core_ops = {
.name = "nvgrace-gpu-vfio-pci-core",
.init = vfio_pci_core_init_dev,
@@ -965,6 +1014,9 @@ static int nvgrace_gpu_probe(struct pci_dev *pdev,
memphys, memlength);
if (ret)
goto out_put_vdev;
nvdev->core_device.pci_ops = &nvgrace_gpu_pci_dev_ops;
} else {
nvdev->core_device.pci_ops = &nvgrace_gpu_pci_dev_core_ops;
}
ret = vfio_pci_core_register_device(&nvdev->core_device);

View File

@@ -147,6 +147,10 @@ static const struct vfio_device_ops vfio_pci_ops = {
.pasid_detach_ioas = vfio_iommufd_physical_pasid_detach_ioas,
};
static const struct vfio_pci_device_ops vfio_pci_dev_ops = {
.get_dmabuf_phys = vfio_pci_core_get_dmabuf_phys,
};
static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
{
struct vfio_pci_core_device *vdev;
@@ -161,6 +165,7 @@ static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
return PTR_ERR(vdev);
dev_set_drvdata(&pdev->dev, vdev);
vdev->pci_ops = &vfio_pci_dev_ops;
ret = vfio_pci_core_register_device(vdev);
if (ret)
goto out_put_vdev;

View File

@@ -589,10 +589,12 @@ static int vfio_basic_config_write(struct vfio_pci_core_device *vdev, int pos,
virt_mem = !!(le16_to_cpu(*virt_cmd) & PCI_COMMAND_MEMORY);
new_mem = !!(new_cmd & PCI_COMMAND_MEMORY);
if (!new_mem)
if (!new_mem) {
vfio_pci_zap_and_down_write_memory_lock(vdev);
else
vfio_pci_dma_buf_move(vdev, true);
} else {
down_write(&vdev->memory_lock);
}
/*
* If the user is writing mem/io enable (new_mem/io) and we
@@ -627,6 +629,8 @@ static int vfio_basic_config_write(struct vfio_pci_core_device *vdev, int pos,
*virt_cmd &= cpu_to_le16(~mask);
*virt_cmd |= cpu_to_le16(new_cmd & mask);
if (__vfio_pci_memory_enabled(vdev))
vfio_pci_dma_buf_move(vdev, false);
up_write(&vdev->memory_lock);
}
@@ -707,12 +711,16 @@ static int __init init_pci_cap_basic_perm(struct perm_bits *perm)
static void vfio_lock_and_set_power_state(struct vfio_pci_core_device *vdev,
pci_power_t state)
{
if (state >= PCI_D3hot)
if (state >= PCI_D3hot) {
vfio_pci_zap_and_down_write_memory_lock(vdev);
else
vfio_pci_dma_buf_move(vdev, true);
} else {
down_write(&vdev->memory_lock);
}
vfio_pci_set_power_state(vdev, state);
if (__vfio_pci_memory_enabled(vdev))
vfio_pci_dma_buf_move(vdev, false);
up_write(&vdev->memory_lock);
}
@@ -900,7 +908,10 @@ static int vfio_exp_config_write(struct vfio_pci_core_device *vdev, int pos,
if (!ret && (cap & PCI_EXP_DEVCAP_FLR)) {
vfio_pci_zap_and_down_write_memory_lock(vdev);
vfio_pci_dma_buf_move(vdev, true);
pci_try_reset_function(vdev->pdev);
if (__vfio_pci_memory_enabled(vdev))
vfio_pci_dma_buf_move(vdev, false);
up_write(&vdev->memory_lock);
}
}
@@ -982,7 +993,10 @@ static int vfio_af_config_write(struct vfio_pci_core_device *vdev, int pos,
if (!ret && (cap & PCI_AF_CAP_FLR) && (cap & PCI_AF_CAP_TP)) {
vfio_pci_zap_and_down_write_memory_lock(vdev);
vfio_pci_dma_buf_move(vdev, true);
pci_try_reset_function(vdev->pdev);
if (__vfio_pci_memory_enabled(vdev))
vfio_pci_dma_buf_move(vdev, false);
up_write(&vdev->memory_lock);
}
}

View File

@@ -28,6 +28,7 @@
#include <linux/nospec.h>
#include <linux/sched/mm.h>
#include <linux/iommufd.h>
#include <linux/pci-p2pdma.h>
#if IS_ENABLED(CONFIG_EEH)
#include <asm/eeh.h>
#endif
@@ -286,6 +287,8 @@ static int vfio_pci_runtime_pm_entry(struct vfio_pci_core_device *vdev,
* semaphore.
*/
vfio_pci_zap_and_down_write_memory_lock(vdev);
vfio_pci_dma_buf_move(vdev, true);
if (vdev->pm_runtime_engaged) {
up_write(&vdev->memory_lock);
return -EINVAL;
@@ -299,11 +302,9 @@ static int vfio_pci_runtime_pm_entry(struct vfio_pci_core_device *vdev,
return 0;
}
static int vfio_pci_core_pm_entry(struct vfio_device *device, u32 flags,
static int vfio_pci_core_pm_entry(struct vfio_pci_core_device *vdev, u32 flags,
void __user *arg, size_t argsz)
{
struct vfio_pci_core_device *vdev =
container_of(device, struct vfio_pci_core_device, vdev);
int ret;
ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_SET, 0);
@@ -320,12 +321,10 @@ static int vfio_pci_core_pm_entry(struct vfio_device *device, u32 flags,
}
static int vfio_pci_core_pm_entry_with_wakeup(
struct vfio_device *device, u32 flags,
struct vfio_pci_core_device *vdev, u32 flags,
struct vfio_device_low_power_entry_with_wakeup __user *arg,
size_t argsz)
{
struct vfio_pci_core_device *vdev =
container_of(device, struct vfio_pci_core_device, vdev);
struct vfio_device_low_power_entry_with_wakeup entry;
struct eventfd_ctx *efdctx;
int ret;
@@ -373,14 +372,14 @@ static void vfio_pci_runtime_pm_exit(struct vfio_pci_core_device *vdev)
*/
down_write(&vdev->memory_lock);
__vfio_pci_runtime_pm_exit(vdev);
if (__vfio_pci_memory_enabled(vdev))
vfio_pci_dma_buf_move(vdev, false);
up_write(&vdev->memory_lock);
}
static int vfio_pci_core_pm_exit(struct vfio_device *device, u32 flags,
static int vfio_pci_core_pm_exit(struct vfio_pci_core_device *vdev, u32 flags,
void __user *arg, size_t argsz)
{
struct vfio_pci_core_device *vdev =
container_of(device, struct vfio_pci_core_device, vdev);
int ret;
ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_SET, 0);
@@ -695,6 +694,8 @@ void vfio_pci_core_close_device(struct vfio_device *core_vdev)
#endif
vfio_pci_core_disable(vdev);
vfio_pci_dma_buf_cleanup(vdev);
mutex_lock(&vdev->igate);
if (vdev->err_trigger) {
eventfd_ctx_put(vdev->err_trigger);
@@ -1227,7 +1228,10 @@ static int vfio_pci_ioctl_reset(struct vfio_pci_core_device *vdev,
*/
vfio_pci_set_power_state(vdev, PCI_D0);
vfio_pci_dma_buf_move(vdev, true);
ret = pci_try_reset_function(vdev->pdev);
if (__vfio_pci_memory_enabled(vdev))
vfio_pci_dma_buf_move(vdev, false);
up_write(&vdev->memory_lock);
return ret;
@@ -1473,11 +1477,10 @@ long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
}
EXPORT_SYMBOL_GPL(vfio_pci_core_ioctl);
static int vfio_pci_core_feature_token(struct vfio_device *device, u32 flags,
uuid_t __user *arg, size_t argsz)
static int vfio_pci_core_feature_token(struct vfio_pci_core_device *vdev,
u32 flags, uuid_t __user *arg,
size_t argsz)
{
struct vfio_pci_core_device *vdev =
container_of(device, struct vfio_pci_core_device, vdev);
uuid_t uuid;
int ret;
@@ -1504,16 +1507,21 @@ static int vfio_pci_core_feature_token(struct vfio_device *device, u32 flags,
int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
void __user *arg, size_t argsz)
{
struct vfio_pci_core_device *vdev =
container_of(device, struct vfio_pci_core_device, vdev);
switch (flags & VFIO_DEVICE_FEATURE_MASK) {
case VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY:
return vfio_pci_core_pm_entry(device, flags, arg, argsz);
return vfio_pci_core_pm_entry(vdev, flags, arg, argsz);
case VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP:
return vfio_pci_core_pm_entry_with_wakeup(device, flags,
return vfio_pci_core_pm_entry_with_wakeup(vdev, flags,
arg, argsz);
case VFIO_DEVICE_FEATURE_LOW_POWER_EXIT:
return vfio_pci_core_pm_exit(device, flags, arg, argsz);
return vfio_pci_core_pm_exit(vdev, flags, arg, argsz);
case VFIO_DEVICE_FEATURE_PCI_VF_TOKEN:
return vfio_pci_core_feature_token(device, flags, arg, argsz);
return vfio_pci_core_feature_token(vdev, flags, arg, argsz);
case VFIO_DEVICE_FEATURE_DMA_BUF:
return vfio_pci_core_feature_dma_buf(vdev, flags, arg, argsz);
default:
return -ENOTTY;
}
@@ -2085,6 +2093,7 @@ int vfio_pci_core_init_dev(struct vfio_device *core_vdev)
{
struct vfio_pci_core_device *vdev =
container_of(core_vdev, struct vfio_pci_core_device, vdev);
int ret;
vdev->pdev = to_pci_dev(core_vdev->dev);
vdev->irq_type = VFIO_PCI_NUM_IRQS;
@@ -2094,6 +2103,10 @@ int vfio_pci_core_init_dev(struct vfio_device *core_vdev)
INIT_LIST_HEAD(&vdev->dummy_resources_list);
INIT_LIST_HEAD(&vdev->ioeventfds_list);
INIT_LIST_HEAD(&vdev->sriov_pfs_item);
ret = pcim_p2pdma_init(vdev->pdev);
if (ret && ret != -EOPNOTSUPP)
return ret;
INIT_LIST_HEAD(&vdev->dmabufs);
init_rwsem(&vdev->memory_lock);
xa_init(&vdev->ctx);
@@ -2458,6 +2471,7 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
break;
}
vfio_pci_dma_buf_move(vdev, true);
vfio_pci_zap_bars(vdev);
}
@@ -2486,8 +2500,11 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
err_undo:
list_for_each_entry_from_reverse(vdev, &dev_set->device_list,
vdev.dev_set_list)
vdev.dev_set_list) {
if (vdev->vdev.open_count && __vfio_pci_memory_enabled(vdev))
vfio_pci_dma_buf_move(vdev, false);
up_write(&vdev->memory_lock);
}
list_for_each_entry(vdev, &dev_set->device_list, vdev.dev_set_list)
pm_runtime_put(&vdev->pdev->dev);

View File

@@ -0,0 +1,350 @@
// SPDX-License-Identifier: GPL-2.0-only
/* Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES.
*/
#include <linux/dma-buf-mapping.h>
#include <linux/pci-p2pdma.h>
#include <linux/dma-resv.h>
#include "vfio_pci_priv.h"
MODULE_IMPORT_NS("DMA_BUF");
struct vfio_pci_dma_buf {
struct dma_buf *dmabuf;
struct vfio_pci_core_device *vdev;
struct list_head dmabufs_elm;
size_t size;
struct dma_buf_phys_vec *phys_vec;
struct p2pdma_provider *provider;
u32 nr_ranges;
u8 revoked : 1;
};
static int vfio_pci_dma_buf_attach(struct dma_buf *dmabuf,
struct dma_buf_attachment *attachment)
{
struct vfio_pci_dma_buf *priv = dmabuf->priv;
if (!attachment->peer2peer)
return -EOPNOTSUPP;
if (priv->revoked)
return -ENODEV;
return 0;
}
static struct sg_table *
vfio_pci_dma_buf_map(struct dma_buf_attachment *attachment,
enum dma_data_direction dir)
{
struct vfio_pci_dma_buf *priv = attachment->dmabuf->priv;
dma_resv_assert_held(priv->dmabuf->resv);
if (priv->revoked)
return ERR_PTR(-ENODEV);
return dma_buf_phys_vec_to_sgt(attachment, priv->provider,
priv->phys_vec, priv->nr_ranges,
priv->size, dir);
}
static void vfio_pci_dma_buf_unmap(struct dma_buf_attachment *attachment,
struct sg_table *sgt,
enum dma_data_direction dir)
{
dma_buf_free_sgt(attachment, sgt, dir);
}
static void vfio_pci_dma_buf_release(struct dma_buf *dmabuf)
{
struct vfio_pci_dma_buf *priv = dmabuf->priv;
/*
* Either this or vfio_pci_dma_buf_cleanup() will remove from the list.
* The refcount prevents both.
*/
if (priv->vdev) {
down_write(&priv->vdev->memory_lock);
list_del_init(&priv->dmabufs_elm);
up_write(&priv->vdev->memory_lock);
vfio_device_put_registration(&priv->vdev->vdev);
}
kfree(priv->phys_vec);
kfree(priv);
}
static const struct dma_buf_ops vfio_pci_dmabuf_ops = {
.attach = vfio_pci_dma_buf_attach,
.map_dma_buf = vfio_pci_dma_buf_map,
.unmap_dma_buf = vfio_pci_dma_buf_unmap,
.release = vfio_pci_dma_buf_release,
};
/*
* This is a temporary "private interconnect" between VFIO DMABUF and iommufd.
* It allows the two co-operating drivers to exchange the physical address of
* the BAR. This is to be replaced with a formal DMABUF system for negotiated
* interconnect types.
*
* If this function succeeds the following are true:
* - There is one physical range and it is pointing to MMIO
* - When move_notify is called it means revoke, not move, vfio_dma_buf_map
* will fail if it is currently revoked
*/
int vfio_pci_dma_buf_iommufd_map(struct dma_buf_attachment *attachment,
struct dma_buf_phys_vec *phys)
{
struct vfio_pci_dma_buf *priv;
dma_resv_assert_held(attachment->dmabuf->resv);
if (attachment->dmabuf->ops != &vfio_pci_dmabuf_ops)
return -EOPNOTSUPP;
priv = attachment->dmabuf->priv;
if (priv->revoked)
return -ENODEV;
/* More than one range to iommufd will require proper DMABUF support */
if (priv->nr_ranges != 1)
return -EOPNOTSUPP;
*phys = priv->phys_vec[0];
return 0;
}
EXPORT_SYMBOL_FOR_MODULES(vfio_pci_dma_buf_iommufd_map, "iommufd");
int vfio_pci_core_fill_phys_vec(struct dma_buf_phys_vec *phys_vec,
struct vfio_region_dma_range *dma_ranges,
size_t nr_ranges, phys_addr_t start,
phys_addr_t len)
{
phys_addr_t max_addr;
unsigned int i;
max_addr = start + len;
for (i = 0; i < nr_ranges; i++) {
phys_addr_t end;
if (!dma_ranges[i].length)
return -EINVAL;
if (check_add_overflow(start, dma_ranges[i].offset,
&phys_vec[i].paddr) ||
check_add_overflow(phys_vec[i].paddr,
dma_ranges[i].length, &end))
return -EOVERFLOW;
if (end > max_addr)
return -EINVAL;
phys_vec[i].len = dma_ranges[i].length;
}
return 0;
}
EXPORT_SYMBOL_GPL(vfio_pci_core_fill_phys_vec);
int vfio_pci_core_get_dmabuf_phys(struct vfio_pci_core_device *vdev,
struct p2pdma_provider **provider,
unsigned int region_index,
struct dma_buf_phys_vec *phys_vec,
struct vfio_region_dma_range *dma_ranges,
size_t nr_ranges)
{
struct pci_dev *pdev = vdev->pdev;
*provider = pcim_p2pdma_provider(pdev, region_index);
if (!*provider)
return -EINVAL;
return vfio_pci_core_fill_phys_vec(
phys_vec, dma_ranges, nr_ranges,
pci_resource_start(pdev, region_index),
pci_resource_len(pdev, region_index));
}
EXPORT_SYMBOL_GPL(vfio_pci_core_get_dmabuf_phys);
static int validate_dmabuf_input(struct vfio_device_feature_dma_buf *dma_buf,
struct vfio_region_dma_range *dma_ranges,
size_t *lengthp)
{
size_t length = 0;
u32 i;
for (i = 0; i < dma_buf->nr_ranges; i++) {
u64 offset = dma_ranges[i].offset;
u64 len = dma_ranges[i].length;
if (!len || !PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
return -EINVAL;
if (check_add_overflow(length, len, &length))
return -EINVAL;
}
/*
* dma_iova_try_alloc() will WARN on if userspace proposes a size that
* is too big, eg with lots of ranges.
*/
if ((u64)(length) & DMA_IOVA_USE_SWIOTLB)
return -EINVAL;
*lengthp = length;
return 0;
}
int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
struct vfio_device_feature_dma_buf __user *arg,
size_t argsz)
{
struct vfio_device_feature_dma_buf get_dma_buf = {};
struct vfio_region_dma_range *dma_ranges;
DEFINE_DMA_BUF_EXPORT_INFO(exp_info);
struct vfio_pci_dma_buf *priv;
size_t length;
int ret;
if (!vdev->pci_ops || !vdev->pci_ops->get_dmabuf_phys)
return -EOPNOTSUPP;
ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_GET,
sizeof(get_dma_buf));
if (ret != 1)
return ret;
if (copy_from_user(&get_dma_buf, arg, sizeof(get_dma_buf)))
return -EFAULT;
if (!get_dma_buf.nr_ranges || get_dma_buf.flags)
return -EINVAL;
/*
* For PCI the region_index is the BAR number like everything else.
*/
if (get_dma_buf.region_index >= VFIO_PCI_ROM_REGION_INDEX)
return -ENODEV;
dma_ranges = memdup_array_user(&arg->dma_ranges, get_dma_buf.nr_ranges,
sizeof(*dma_ranges));
if (IS_ERR(dma_ranges))
return PTR_ERR(dma_ranges);
ret = validate_dmabuf_input(&get_dma_buf, dma_ranges, &length);
if (ret)
goto err_free_ranges;
priv = kzalloc(sizeof(*priv), GFP_KERNEL);
if (!priv) {
ret = -ENOMEM;
goto err_free_ranges;
}
priv->phys_vec = kcalloc(get_dma_buf.nr_ranges, sizeof(*priv->phys_vec),
GFP_KERNEL);
if (!priv->phys_vec) {
ret = -ENOMEM;
goto err_free_priv;
}
priv->vdev = vdev;
priv->nr_ranges = get_dma_buf.nr_ranges;
priv->size = length;
ret = vdev->pci_ops->get_dmabuf_phys(vdev, &priv->provider,
get_dma_buf.region_index,
priv->phys_vec, dma_ranges,
priv->nr_ranges);
if (ret)
goto err_free_phys;
kfree(dma_ranges);
dma_ranges = NULL;
if (!vfio_device_try_get_registration(&vdev->vdev)) {
ret = -ENODEV;
goto err_free_phys;
}
exp_info.ops = &vfio_pci_dmabuf_ops;
exp_info.size = priv->size;
exp_info.flags = get_dma_buf.open_flags;
exp_info.priv = priv;
priv->dmabuf = dma_buf_export(&exp_info);
if (IS_ERR(priv->dmabuf)) {
ret = PTR_ERR(priv->dmabuf);
goto err_dev_put;
}
/* dma_buf_put() now frees priv */
INIT_LIST_HEAD(&priv->dmabufs_elm);
down_write(&vdev->memory_lock);
dma_resv_lock(priv->dmabuf->resv, NULL);
priv->revoked = !__vfio_pci_memory_enabled(vdev);
list_add_tail(&priv->dmabufs_elm, &vdev->dmabufs);
dma_resv_unlock(priv->dmabuf->resv);
up_write(&vdev->memory_lock);
/*
* dma_buf_fd() consumes the reference, when the file closes the dmabuf
* will be released.
*/
ret = dma_buf_fd(priv->dmabuf, get_dma_buf.open_flags);
if (ret < 0)
goto err_dma_buf;
return ret;
err_dma_buf:
dma_buf_put(priv->dmabuf);
err_dev_put:
vfio_device_put_registration(&vdev->vdev);
err_free_phys:
kfree(priv->phys_vec);
err_free_priv:
kfree(priv);
err_free_ranges:
kfree(dma_ranges);
return ret;
}
void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked)
{
struct vfio_pci_dma_buf *priv;
struct vfio_pci_dma_buf *tmp;
lockdep_assert_held_write(&vdev->memory_lock);
list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) {
if (!get_file_active(&priv->dmabuf->file))
continue;
if (priv->revoked != revoked) {
dma_resv_lock(priv->dmabuf->resv, NULL);
priv->revoked = revoked;
dma_buf_move_notify(priv->dmabuf);
dma_resv_unlock(priv->dmabuf->resv);
}
fput(priv->dmabuf->file);
}
}
void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev)
{
struct vfio_pci_dma_buf *priv;
struct vfio_pci_dma_buf *tmp;
down_write(&vdev->memory_lock);
list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) {
if (!get_file_active(&priv->dmabuf->file))
continue;
dma_resv_lock(priv->dmabuf->resv, NULL);
list_del_init(&priv->dmabufs_elm);
priv->vdev = NULL;
priv->revoked = true;
dma_buf_move_notify(priv->dmabuf);
dma_resv_unlock(priv->dmabuf->resv);
vfio_device_put_registration(&vdev->vdev);
fput(priv->dmabuf->file);
}
up_write(&vdev->memory_lock);
}

View File

@@ -107,4 +107,27 @@ static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA;
}
#ifdef CONFIG_VFIO_PCI_DMABUF
int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
struct vfio_device_feature_dma_buf __user *arg,
size_t argsz);
void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev);
void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked);
#else
static inline int
vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
struct vfio_device_feature_dma_buf __user *arg,
size_t argsz)
{
return -ENOTTY;
}
static inline void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev)
{
}
static inline void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev,
bool revoked)
{
}
#endif
#endif

View File

@@ -172,11 +172,13 @@ void vfio_device_put_registration(struct vfio_device *device)
if (refcount_dec_and_test(&device->refcount))
complete(&device->comp);
}
EXPORT_SYMBOL_GPL(vfio_device_put_registration);
bool vfio_device_try_get_registration(struct vfio_device *device)
{
return refcount_inc_not_zero(&device->refcount);
}
EXPORT_SYMBOL_GPL(vfio_device_try_get_registration);
/*
* VFIO driver API

View File

@@ -0,0 +1,17 @@
/* SPDX-License-Identifier: GPL-2.0-only */
/*
* DMA BUF Mapping Helpers
*
*/
#ifndef __DMA_BUF_MAPPING_H__
#define __DMA_BUF_MAPPING_H__
#include <linux/dma-buf.h>
struct sg_table *dma_buf_phys_vec_to_sgt(struct dma_buf_attachment *attach,
struct p2pdma_provider *provider,
struct dma_buf_phys_vec *phys_vec,
size_t nr_ranges, size_t size,
enum dma_data_direction dir);
void dma_buf_free_sgt(struct dma_buf_attachment *attach, struct sg_table *sgt,
enum dma_data_direction dir);
#endif

View File

@@ -22,6 +22,7 @@
#include <linux/fs.h>
#include <linux/dma-fence.h>
#include <linux/wait.h>
#include <linux/pci-p2pdma.h>
struct device;
struct dma_buf;
@@ -530,6 +531,16 @@ struct dma_buf_export_info {
void *priv;
};
/**
* struct dma_buf_phys_vec - describe continuous chunk of memory
* @paddr: physical address of that chunk
* @len: Length of this chunk
*/
struct dma_buf_phys_vec {
phys_addr_t paddr;
size_t len;
};
/**
* DEFINE_DMA_BUF_EXPORT_INFO - helper macro for exporters
* @name: export-info name

View File

@@ -16,7 +16,58 @@
struct block_device;
struct scatterlist;
/**
* struct p2pdma_provider
*
* A p2pdma provider is a range of MMIO address space available to the CPU.
*/
struct p2pdma_provider {
struct device *owner;
u64 bus_offset;
};
enum pci_p2pdma_map_type {
/*
* PCI_P2PDMA_MAP_UNKNOWN: Used internally as an initial state before
* the mapping type has been calculated. Exported routines for the API
* will never return this value.
*/
PCI_P2PDMA_MAP_UNKNOWN = 0,
/*
* Not a PCI P2PDMA transfer.
*/
PCI_P2PDMA_MAP_NONE,
/*
* PCI_P2PDMA_MAP_NOT_SUPPORTED: Indicates the transaction will
* traverse the host bridge and the host bridge is not in the
* allowlist. DMA Mapping routines should return an error when
* this is returned.
*/
PCI_P2PDMA_MAP_NOT_SUPPORTED,
/*
* PCI_P2PDMA_MAP_BUS_ADDR: Indicates that two devices can talk to
* each other directly through a PCI switch and the transaction will
* not traverse the host bridge. Such a mapping should program
* the DMA engine with PCI bus addresses.
*/
PCI_P2PDMA_MAP_BUS_ADDR,
/*
* PCI_P2PDMA_MAP_THRU_HOST_BRIDGE: Indicates two devices can talk
* to each other, but the transaction traverses a host bridge on the
* allowlist. In this case, a normal mapping either with CPU physical
* addresses (in the case of dma-direct) or IOVA addresses (in the
* case of IOMMUs) should be used to program the DMA engine.
*/
PCI_P2PDMA_MAP_THRU_HOST_BRIDGE,
};
#ifdef CONFIG_PCI_P2PDMA
int pcim_p2pdma_init(struct pci_dev *pdev);
struct p2pdma_provider *pcim_p2pdma_provider(struct pci_dev *pdev, int bar);
int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
u64 offset);
int pci_p2pdma_distance_many(struct pci_dev *provider, struct device **clients,
@@ -33,7 +84,18 @@ int pci_p2pdma_enable_store(const char *page, struct pci_dev **p2p_dev,
bool *use_p2pdma);
ssize_t pci_p2pdma_enable_show(char *page, struct pci_dev *p2p_dev,
bool use_p2pdma);
enum pci_p2pdma_map_type pci_p2pdma_map_type(struct p2pdma_provider *provider,
struct device *dev);
#else /* CONFIG_PCI_P2PDMA */
static inline int pcim_p2pdma_init(struct pci_dev *pdev)
{
return -EOPNOTSUPP;
}
static inline struct p2pdma_provider *pcim_p2pdma_provider(struct pci_dev *pdev,
int bar)
{
return NULL;
}
static inline int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar,
size_t size, u64 offset)
{
@@ -85,6 +147,11 @@ static inline ssize_t pci_p2pdma_enable_show(char *page,
{
return sprintf(page, "none\n");
}
static inline enum pci_p2pdma_map_type
pci_p2pdma_map_type(struct p2pdma_provider *provider, struct device *dev)
{
return PCI_P2PDMA_MAP_NOT_SUPPORTED;
}
#endif /* CONFIG_PCI_P2PDMA */
@@ -99,51 +166,12 @@ static inline struct pci_dev *pci_p2pmem_find(struct device *client)
return pci_p2pmem_find_many(&client, 1);
}
enum pci_p2pdma_map_type {
/*
* PCI_P2PDMA_MAP_UNKNOWN: Used internally as an initial state before
* the mapping type has been calculated. Exported routines for the API
* will never return this value.
*/
PCI_P2PDMA_MAP_UNKNOWN = 0,
/*
* Not a PCI P2PDMA transfer.
*/
PCI_P2PDMA_MAP_NONE,
/*
* PCI_P2PDMA_MAP_NOT_SUPPORTED: Indicates the transaction will
* traverse the host bridge and the host bridge is not in the
* allowlist. DMA Mapping routines should return an error when
* this is returned.
*/
PCI_P2PDMA_MAP_NOT_SUPPORTED,
/*
* PCI_P2PDMA_MAP_BUS_ADDR: Indicates that two devices can talk to
* each other directly through a PCI switch and the transaction will
* not traverse the host bridge. Such a mapping should program
* the DMA engine with PCI bus addresses.
*/
PCI_P2PDMA_MAP_BUS_ADDR,
/*
* PCI_P2PDMA_MAP_THRU_HOST_BRIDGE: Indicates two devices can talk
* to each other, but the transaction traverses a host bridge on the
* allowlist. In this case, a normal mapping either with CPU physical
* addresses (in the case of dma-direct) or IOVA addresses (in the
* case of IOMMUs) should be used to program the DMA engine.
*/
PCI_P2PDMA_MAP_THRU_HOST_BRIDGE,
};
struct pci_p2pdma_map_state {
struct dev_pagemap *pgmap;
struct p2pdma_provider *mem;
enum pci_p2pdma_map_type map;
u64 bus_off;
};
/* helper for pci_p2pdma_state(), do not use directly */
void __pci_p2pdma_update_state(struct pci_p2pdma_map_state *state,
struct device *dev, struct page *page);
@@ -162,8 +190,7 @@ pci_p2pdma_state(struct pci_p2pdma_map_state *state, struct device *dev,
struct page *page)
{
if (IS_ENABLED(CONFIG_PCI_P2PDMA) && is_pci_p2pdma_page(page)) {
if (state->pgmap != page_pgmap(page))
__pci_p2pdma_update_state(state, dev, page);
__pci_p2pdma_update_state(state, dev, page);
return state->map;
}
return PCI_P2PDMA_MAP_NONE;
@@ -172,16 +199,15 @@ pci_p2pdma_state(struct pci_p2pdma_map_state *state, struct device *dev,
/**
* pci_p2pdma_bus_addr_map - Translate a physical address to a bus address
* for a PCI_P2PDMA_MAP_BUS_ADDR transfer.
* @state: P2P state structure
* @provider: P2P provider structure
* @paddr: physical address to map
*
* Map a physically contiguous PCI_P2PDMA_MAP_BUS_ADDR transfer.
*/
static inline dma_addr_t
pci_p2pdma_bus_addr_map(struct pci_p2pdma_map_state *state, phys_addr_t paddr)
pci_p2pdma_bus_addr_map(struct p2pdma_provider *provider, phys_addr_t paddr)
{
WARN_ON_ONCE(state->map != PCI_P2PDMA_MAP_BUS_ADDR);
return paddr + state->bus_off;
return paddr + provider->bus_offset;
}
#endif /* _LINUX_PCI_P2P_H */

View File

@@ -297,6 +297,8 @@ static inline void vfio_put_device(struct vfio_device *device)
int vfio_register_group_dev(struct vfio_device *device);
int vfio_register_emulated_iommu_dev(struct vfio_device *device);
void vfio_unregister_group_dev(struct vfio_device *device);
bool vfio_device_try_get_registration(struct vfio_device *device);
void vfio_device_put_registration(struct vfio_device *device);
int vfio_assign_device_set(struct vfio_device *device, void *set_id);
unsigned int vfio_device_set_open_count(struct vfio_device_set *dev_set);

View File

@@ -26,6 +26,9 @@
struct vfio_pci_core_device;
struct vfio_pci_region;
struct p2pdma_provider;
struct dma_buf_phys_vec;
struct dma_buf_attachment;
struct vfio_pci_regops {
ssize_t (*rw)(struct vfio_pci_core_device *vdev, char __user *buf,
@@ -49,9 +52,48 @@ struct vfio_pci_region {
u32 flags;
};
struct vfio_pci_device_ops {
int (*get_dmabuf_phys)(struct vfio_pci_core_device *vdev,
struct p2pdma_provider **provider,
unsigned int region_index,
struct dma_buf_phys_vec *phys_vec,
struct vfio_region_dma_range *dma_ranges,
size_t nr_ranges);
};
#if IS_ENABLED(CONFIG_VFIO_PCI_DMABUF)
int vfio_pci_core_fill_phys_vec(struct dma_buf_phys_vec *phys_vec,
struct vfio_region_dma_range *dma_ranges,
size_t nr_ranges, phys_addr_t start,
phys_addr_t len);
int vfio_pci_core_get_dmabuf_phys(struct vfio_pci_core_device *vdev,
struct p2pdma_provider **provider,
unsigned int region_index,
struct dma_buf_phys_vec *phys_vec,
struct vfio_region_dma_range *dma_ranges,
size_t nr_ranges);
#else
static inline int
vfio_pci_core_fill_phys_vec(struct dma_buf_phys_vec *phys_vec,
struct vfio_region_dma_range *dma_ranges,
size_t nr_ranges, phys_addr_t start,
phys_addr_t len)
{
return -EINVAL;
}
static inline int vfio_pci_core_get_dmabuf_phys(
struct vfio_pci_core_device *vdev, struct p2pdma_provider **provider,
unsigned int region_index, struct dma_buf_phys_vec *phys_vec,
struct vfio_region_dma_range *dma_ranges, size_t nr_ranges)
{
return -EOPNOTSUPP;
}
#endif
struct vfio_pci_core_device {
struct vfio_device vdev;
struct pci_dev *pdev;
const struct vfio_pci_device_ops *pci_ops;
void __iomem *barmap[PCI_STD_NUM_BARS];
bool bar_mmap_supported[PCI_STD_NUM_BARS];
u8 *pci_config_map;
@@ -94,6 +136,7 @@ struct vfio_pci_core_device {
struct vfio_pci_core_device *sriov_pf_core_dev;
struct notifier_block nb;
struct rw_semaphore memory_lock;
struct list_head dmabufs;
};
/* Will be exported for vfio pci drivers usage */
@@ -161,4 +204,7 @@ VFIO_IOREAD_DECLARATION(32)
VFIO_IOREAD_DECLARATION(64)
#endif
int vfio_pci_dma_buf_iommufd_map(struct dma_buf_attachment *attachment,
struct dma_buf_phys_vec *phys);
#endif /* VFIO_PCI_CORE_H */

View File

@@ -14,6 +14,7 @@
#include <linux/types.h>
#include <linux/ioctl.h>
#include <linux/stddef.h>
#define VFIO_API_VERSION 0
@@ -1478,6 +1479,33 @@ struct vfio_device_feature_bus_master {
};
#define VFIO_DEVICE_FEATURE_BUS_MASTER 10
/**
* Upon VFIO_DEVICE_FEATURE_GET create a dma_buf fd for the
* regions selected.
*
* open_flags are the typical flags passed to open(2), eg O_RDWR, O_CLOEXEC,
* etc. offset/length specify a slice of the region to create the dmabuf from.
* nr_ranges is the total number of (P2P DMA) ranges that comprise the dmabuf.
*
* flags should be 0.
*
* Return: The fd number on success, -1 and errno is set on failure.
*/
#define VFIO_DEVICE_FEATURE_DMA_BUF 11
struct vfio_region_dma_range {
__u64 offset;
__u64 length;
};
struct vfio_device_feature_dma_buf {
__u32 region_index;
__u32 open_flags;
__u32 flags;
__u32 nr_ranges;
struct vfio_region_dma_range dma_ranges[] __counted_by(nr_ranges);
};
/* -------- API for Type1 VFIO IOMMU -------- */
/**

View File

@@ -479,8 +479,8 @@ int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, int nents,
}
break;
case PCI_P2PDMA_MAP_BUS_ADDR:
sg->dma_address = pci_p2pdma_bus_addr_map(&p2pdma_state,
sg_phys(sg));
sg->dma_address = pci_p2pdma_bus_addr_map(
p2pdma_state.mem, sg_phys(sg));
sg_dma_mark_bus_address(sg);
continue;
default:

View File

@@ -811,7 +811,7 @@ dma_addr_t hmm_dma_map_pfn(struct device *dev, struct hmm_dma_map *map,
break;
case PCI_P2PDMA_MAP_BUS_ADDR:
pfns[idx] |= HMM_PFN_P2PDMA_BUS | HMM_PFN_DMA_MAPPED;
return pci_p2pdma_bus_addr_map(p2pdma_state, paddr);
return pci_p2pdma_bus_addr_map(p2pdma_state->mem, paddr);
default:
return DMA_MAPPING_ERROR;
}

View File

@@ -1574,6 +1574,49 @@ TEST_F(iommufd_ioas, copy_sweep)
test_ioctl_destroy(dst_ioas_id);
}
TEST_F(iommufd_ioas, dmabuf_simple)
{
size_t buf_size = PAGE_SIZE*4;
__u64 iova;
int dfd;
test_cmd_get_dmabuf(buf_size, &dfd);
test_err_ioctl_ioas_map_file(EINVAL, dfd, 0, 0, &iova);
test_err_ioctl_ioas_map_file(EINVAL, dfd, buf_size, buf_size, &iova);
test_err_ioctl_ioas_map_file(EINVAL, dfd, 0, buf_size + 1, &iova);
test_ioctl_ioas_map_file(dfd, 0, buf_size, &iova);
close(dfd);
}
TEST_F(iommufd_ioas, dmabuf_revoke)
{
size_t buf_size = PAGE_SIZE*4;
__u32 hwpt_id;
__u64 iova;
__u64 iova2;
int dfd;
test_cmd_get_dmabuf(buf_size, &dfd);
test_ioctl_ioas_map_file(dfd, 0, buf_size, &iova);
test_cmd_revoke_dmabuf(dfd, true);
if (variant->mock_domains)
test_cmd_hwpt_alloc(self->device_id, self->ioas_id, 0,
&hwpt_id);
test_err_ioctl_ioas_map_file(ENODEV, dfd, 0, buf_size, &iova2);
test_cmd_revoke_dmabuf(dfd, false);
test_ioctl_ioas_map_file(dfd, 0, buf_size, &iova2);
/* Restore the iova back */
test_ioctl_ioas_unmap(iova, buf_size);
test_ioctl_ioas_map_fixed_file(dfd, 0, buf_size, iova);
close(dfd);
}
FIXTURE(iommufd_mock_domain)
{
int fd;

View File

@@ -548,6 +548,39 @@ static int _test_cmd_destroy_access_pages(int fd, unsigned int access_id,
EXPECT_ERRNO(_errno, _test_cmd_destroy_access_pages( \
self->fd, access_id, access_pages_id))
static int _test_cmd_get_dmabuf(int fd, size_t len, int *out_fd)
{
struct iommu_test_cmd cmd = {
.size = sizeof(cmd),
.op = IOMMU_TEST_OP_DMABUF_GET,
.dmabuf_get = { .length = len, .open_flags = O_CLOEXEC },
};
*out_fd = ioctl(fd, IOMMU_TEST_CMD, &cmd);
if (*out_fd < 0)
return -1;
return 0;
}
#define test_cmd_get_dmabuf(len, out_fd) \
ASSERT_EQ(0, _test_cmd_get_dmabuf(self->fd, len, out_fd))
static int _test_cmd_revoke_dmabuf(int fd, int dmabuf_fd, bool revoked)
{
struct iommu_test_cmd cmd = {
.size = sizeof(cmd),
.op = IOMMU_TEST_OP_DMABUF_REVOKE,
.dmabuf_revoke = { .dmabuf_fd = dmabuf_fd, .revoked = revoked },
};
int ret;
ret = ioctl(fd, IOMMU_TEST_CMD, &cmd);
if (ret < 0)
return -1;
return 0;
}
#define test_cmd_revoke_dmabuf(dmabuf_fd, revoke) \
ASSERT_EQ(0, _test_cmd_revoke_dmabuf(self->fd, dmabuf_fd, revoke))
static int _test_ioctl_destroy(int fd, unsigned int id)
{
struct iommu_destroy cmd = {
@@ -718,6 +751,17 @@ static int _test_ioctl_ioas_map_file(int fd, unsigned int ioas_id, int mfd,
self->fd, ioas_id, mfd, start, length, iova_p, \
IOMMU_IOAS_MAP_WRITEABLE | IOMMU_IOAS_MAP_READABLE))
#define test_ioctl_ioas_map_fixed_file(mfd, start, length, iova) \
({ \
__u64 __iova = iova; \
ASSERT_EQ(0, _test_ioctl_ioas_map_file( \
self->fd, self->ioas_id, mfd, start, \
length, &__iova, \
IOMMU_IOAS_MAP_FIXED_IOVA | \
IOMMU_IOAS_MAP_WRITEABLE | \
IOMMU_IOAS_MAP_READABLE)); \
})
static int _test_ioctl_set_temp_memory_limit(int fd, unsigned int limit)
{
struct iommu_test_cmd memlimit_cmd = {