PCIe r7.0, sec 7.8.6, defines resizable BAR sizes beyond the currently
supported maximum of 128TB, which will require more than u32 to store the
entire bitmask.
Convert Resizable BAR related functions to use u64 bitmask for BAR sizes to
make the typing more future-proof.
The support for the larger BAR sizes themselves is not added at this point.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Link: https://patch.msgid.link/20251113180053.27944-12-ilpo.jarvinen@linux.intel.com
pci_rebar_set_size() adjusts BAR size for both normal and IOV BARs. The
struct pci_sriov keeps a cached copy of BAR size in ->barsz[] which is not
adjusted by pci_rebar_set_size() but by pci_iov_resource_set_size().
pci_iov_resource_set_size() is called also from
pci_resize_resource_set_size().
The current arrangement is problematic once BAR resize algorithm starts to
roll back changes properly in case of a failure. The normal resource
fitting algorithm rolls back resource size using the struct
pci_dev_resource easily but also calling pci_resize_resource_set_size() or
pci_iov_resource_set_size() to roll back BAR size would be an extra burden,
whereas combining ->barsz[] update with pci_rebar_set_size() naturally
rolls back it when restoring the old BAR size on a different layer of the
BAR resize operation.
Thus, rework pci_rebar_set_size() to also update ->barsz[].
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Tested-by: Alex Bennée <alex.bennee@linaro.org> # AVA, AMD GPU
Link: https://patch.msgid.link/20251113162628.5946-3-ilpo.jarvinen@linux.intel.com
Before disabling SR-IOV via config space accesses to the parent PF,
sriov_disable() first removes the PCI devices representing the VFs.
Since commit 9d16947b75 ("PCI: Add global pci_lock_rescan_remove()")
such removal operations are serialized against concurrent remove and
rescan using the pci_rescan_remove_lock. No such locking was ever added
in sriov_disable() however. In particular when commit 18f9e9d150
("PCI/IOV: Factor out sriov_add_vfs()") factored out the PCI device
removal into sriov_del_vfs() there was still no locking around the
pci_iov_remove_virtfn() calls.
On s390 the lack of serialization in sriov_disable() may cause double
remove and list corruption with the below (amended) trace being observed:
PSW: 0704c00180000000 0000000c914e4b38 (klist_put+56)
GPRS: 000003800313fb48 0000000000000000 0000000100000001 0000000000000001
00000000f9b520a8 0000000000000000 0000000000002fbd 00000000f4cc9480
0000000000000001 0000000000000000 0000000000000000 0000000180692828
00000000818e8000 000003800313fe2c 000003800313fb20 000003800313fad8
#0 [3800313fb20] device_del at c9158ad5c
#1 [3800313fb88] pci_remove_bus_device at c915105ba
#2 [3800313fbd0] pci_iov_remove_virtfn at c9152f198
#3 [3800313fc28] zpci_iov_remove_virtfn at c90fb67c0
#4 [3800313fc60] zpci_bus_remove_device at c90fb6104
#5 [3800313fca0] __zpci_event_availability at c90fb3dca
#6 [3800313fd08] chsc_process_sei_nt0 at c918fe4a2
#7 [3800313fd60] crw_collect_info at c91905822
#8 [3800313fe10] kthread at c90feb390
#9 [3800313fe68] __ret_from_fork at c90f6aa64
#10 [3800313fe98] ret_from_fork at c9194f3f2.
This is because in addition to sriov_disable() removing the VFs, the
platform also generates hot-unplug events for the VFs. This being the
reverse operation to the hotplug events generated by sriov_enable() and
handled via pdev->no_vf_scan. And while the event processing takes
pci_rescan_remove_lock and checks whether the struct pci_dev still exists,
the lack of synchronization makes this checking racy.
Other races may also be possible of course though given that this lack of
locking persisted so long observable races seem very rare. Even on s390 the
list corruption was only observed with certain devices since the platform
events are only triggered by config accesses after the removal, so as long
as the removal finished synchronously they would not race. Either way the
locking is missing so fix this by adding it to the sriov_del_vfs() helper.
Just like PCI rescan-remove, locking is also missing in sriov_add_vfs()
including for the error case where pci_stop_and_remove_bus_device() is
called without the PCI rescan-remove lock being held. Even in the non-error
case, adding new PCI devices and buses should be serialized via the PCI
rescan-remove lock. Add the necessary locking.
Fixes: 18f9e9d150 ("PCI/IOV: Factor out sriov_add_vfs()")
Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Benjamin Block <bblock@linux.ibm.com>
Reviewed-by: Farhan Ali <alifm@linux.ibm.com>
Reviewed-by: Julian Ruess <julianr@linux.ibm.com>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20250826-pci_fix_sriov_disable-v1-1-2d0bc938f2a3@linux.ibm.com
Drivers could leverage the fact that the VF BAR MMIO reservation is created
for total number of VFs supported by the device by resizing the BAR to
larger size when smaller number of VFs is enabled.
Add pci_iov_vf_bar_set_size() to control the size and a
pci_iov_vf_bar_get_sizes() helper to get the VF BAR sizes that will allow
up to num_vfs to be successfully enabled with the current underlying
reservation size.
Signed-off-by: Michał Winiarski <michal.winiarski@intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Link: https://patch.msgid.link/20250702093522.518099-6-michal.winiarski@intel.com
When the resource representing a VF MMIO BAR reservation is created, its
size is always large enough to accommodate the BAR of all SR-IOV Virtual
Functions that can potentially be created (total VFs). If for whatever
reason it's not possible to accommodate all VFs, the resource is not
assigned and no VFs can be created.
An upcoming change will allow VF BAR size to be modified by drivers at a
later point in time, which means that the check for resource assignment is
no longer sufficient.
Add an additional check that verifies that the VF BAR for all enabled VFs
fits within the underlying reservation resource.
Signed-off-by: Michał Winiarski <michal.winiarski@intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Link: https://patch.msgid.link/20250702093522.518099-5-michal.winiarski@intel.com
Similar to regular resizable BARs, VF BARs can also be resized.
The capability layout is the same as PCI_EXT_CAP_ID_REBAR, which means we
can reuse most of the implementation, the only difference being resource
size calculation (which is multiplied by total VFs) and memory decoding
(which is controlled by a separate VF MSE field in SR-IOV cap).
Extend the pci_resize_resource() function to accept IOV resources.
See PCIe r6.2, sec 7.8.7.
Signed-off-by: Michał Winiarski <michal.winiarski@intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Link: https://patch.msgid.link/20250702093522.518099-4-michal.winiarski@intel.com
Toggling memory enable is free on bare metal, but potentially expensive
in virtualized environments as the device MMIO spaces are added and
removed from the VM address space, including DMA mapping of those spaces
through the IOMMU where peer-to-peer is supported. Currently memory
decode is disabled around sizing each individual BAR, even for SR-IOV
BARs while VF Enable is cleared.
This can be better optimized for virtual environments by sizing a set
of BARs at once, stashing the resulting mask into an array, while only
toggling memory enable once. This also naturally improves the SR-IOV
path as the caller becomes responsible for any necessary decode disables
while sizing BARs, therefore SR-IOV BARs are sized relying only on the
VF Enable rather than toggling the PF memory enable in the command
register.
Link: https://lore.kernel.org/r/20250120182202.1878581-1-alex.williamson@redhat.com
Reported-by: Mitchell Augustin <mitchell.augustin@canonical.com>
Link: https://lore.kernel.org/r/CAHTA-uYp07FgM6T1OZQKqAdSA5JrZo0ReNEyZgQZub4mDRrV5w@mail.gmail.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Tested-by: Mitchell Augustin <mitchell.augustin@canonical.com>
Reviewed-by: Mitchell Augustin <mitchell.augustin@canonical.com>
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
The sysfs link name "virtfn%u" constructed by pci_iov_sysfs_link() requires
17 bytes to contain the longest possible string. Increase VIRTFN_ID_LEN to
accommodate that.
Found by Linux Verification Center (linuxtesting.org) with SVACE.
[bhelgaas: commit log, comment at #define]
Fixes: dd7cc44d0b ("PCI: add SR-IOV API for Physical Function driver")
Link: https://lore.kernel.org/r/20221218033347.23743-1-gremlin@altlinux.org
Signed-off-by: Alexey V. Vissarionov <gremlin@altlinux.org>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
There are some cases where a SR-IOV VF driver will need to reach into and
interact with the PF driver. This requires accessing the drvdata of the PF.
Provide a function pci_iov_get_pf_drvdata() to return this PF drvdata in a
safe way. Normally accessing a drvdata of a foreign struct device would be
done using the device_lock() to protect against device driver
probe()/remove() races.
However, due to the design of pci_enable_sriov() this will result in a
ABBA deadlock on the device_lock as the PF's device_lock is held during PF
sriov_configure() while calling pci_enable_sriov() which in turn holds the
VF's device_lock while calling VF probe(), and similarly for remove.
This means the VF driver can never obtain the PF's device_lock.
Instead use the implicit locking created by pci_enable/disable_sriov(). A
VF driver can access its PF drvdata only while its own driver is attached,
and the PF driver can control access to its own drvdata based on when it
calls pci_enable/disable_sriov().
To use this API the PF driver will setup the PF drvdata in the probe()
function. pci_enable_sriov() is only called from sriov_configure() which
cannot happen until probe() completes, ensuring no VF races with drvdata
setup.
For removal, the PF driver must call pci_disable_sriov() in its remove
function before destroying any of the drvdata. This ensures that all VF
drivers are unbound before returning, fencing concurrent access to the
drvdata.
The introduction of a new function to do this access makes clear the
special locking scheme and the documents the requirements on the PF/VF
drivers using this.
Link: https://lore.kernel.org/all/20220224142024.147653-5-yishaih@nvidia.com
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
The PCI core uses the VF index internally, often called the vf_id,
during the setup of the VF, eg pci_iov_add_virtfn().
This index is needed for device drivers that implement live migration
for their internal operations that configure/control their VFs.
Specifically, mlx5_vfio_pci driver that is introduced in coming patches
from this series needs it and not the bus/device/function which is
exposed today.
Add pci_iov_vf_id() which computes the vf_id by reversing the math that
was used to create the bus/device/function.
Link: https://lore.kernel.org/all/20220224142024.147653-2-yishaih@nvidia.com
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
This reverts commit 2a4d9408c9.
Robert reported a NULL pointer dereference caused by the PCI core
(local_pci_probe()) calling the i2c_designware_pci driver's
.runtime_resume() method before the .probe() method. i2c_dw_pci_resume()
depends on initialization done by i2c_dw_pci_probe().
Prior to 2a4d9408c9 ("PCI: Use to_pci_driver() instead of
pci_dev->driver"), pci_pm_runtime_resume() avoided calling the
.runtime_resume() method because pci_dev->driver had not been set yet.
2a4d9408c9 and b5f9c644eb ("PCI: Remove struct pci_dev->driver"),
removed pci_dev->driver, replacing it by device->driver, which *has* been
set by this time, so pci_pm_runtime_resume() called the .runtime_resume()
method when it previously had not.
Fixes: 2a4d9408c9 ("PCI: Use to_pci_driver() instead of pci_dev->driver")
Link: https://lore.kernel.org/linux-i2c/CAP145pgdrdiMAT7=-iB1DMgA7t_bMqTcJL4N0=6u8kNY3EU0dw@mail.gmail.com/
Reported-by: Robert Święcki <robert@swiecki.net>
Tested-by: Robert Święcki <robert@swiecki.net>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
- Check for CAP_SYS_ADMIN before validating sysfs user input, not after
(Krzysztof Wilczyński)
- Always return -EINVAL from sysfs "store" functions for invalid user input
instead of -EINVAL sometimes and -ERANGE others (Krzysztof Wilczyński)
- Use kstrtobool() directly instead of the strtobool() wrapper (Krzysztof
Wilczyński)
* pci/sysfs:
PCI: Use kstrtobool() directly, sans strtobool() wrapper
PCI/sysfs: Return -EINVAL consistently from "store" functions
PCI/sysfs: Check CAP_SYS_ADMIN before parsing user input
# Conflicts:
# drivers/pci/iov.c
Struct pci_driver contains a struct device_driver, so for PCI devices, it's
easy to convert a device_driver * to a pci_driver * with to_pci_driver().
The device_driver * is in struct device, so we don't need to also keep
track of the pci_driver * in struct pci_dev.
Replace pci_dev->driver with to_pci_driver(). This is a step toward
removing pci_dev->driver.
[bhelgaas: split to separate patch]
Link: https://lore.kernel.org/r/20211004125935.2300113-11-u.kleine-koenig@pengutronix.de
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Most of the "store" functions that handle userspace input via sysfs return
-EINVAL should the value fail validation and/or type conversion. This
error code is a clear message to userspace that the value is not a valid
input.
However, some of the "show" functions return input parsing error codes
as-is, which may be either -EINVAL or -ERANGE. The former would often be
from kstrtobool(), and the latter typically from other kstr*() functions
such as kstrtou8(), kstrtou32(), kstrtoint(), etc.
-EINVAL is commonly returned as the error code to indicate that the value
provided is invalid, but -ERANGE is not very useful in userspace.
Therefore, normalize the return error code to be -EINVAL for when the
validation and/or type conversion fails.
Link: https://lore.kernel.org/r/20210915230127.2495723-2-kw@linux.com
Signed-off-by: Krzysztof Wilczyński <kw@linux.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
The sysfs_emit() and sysfs_emit_at() functions were introduced to make
it less ambiguous which function is preferred when writing to the output
buffer in a device attribute's "show" callback [1].
Convert the PCI sysfs object "show" functions from sprintf(), snprintf()
and scnprintf() to sysfs_emit() and sysfs_emit_at() accordingly, as the
latter is aware of the PAGE_SIZE buffer and correctly returns the number
of bytes written into the buffer.
No functional change intended.
[1] Documentation/filesystems/sysfs.rst
Related commit: ad025f8e46 ("PCI/sysfs: Use sysfs_emit() and
sysfs_emit_at() in "show" functions").
Link: https://lore.kernel.org/r/20210603000112.703037-2-kw@linux.com
Signed-off-by: Krzysztof Wilczyński <kw@linux.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
A typical cloud provider SR-IOV use case is to create many VFs for use by
guest VMs. The VFs may not be assigned to a VM until a customer requests a
VM of a certain size, e.g., number of CPUs. A VF may need MSI-X vectors
proportional to the number of CPUs in the VM, but there is no standard way
to change the number of MSI-X vectors supported by a VF.
Some Mellanox ConnectX devices support dynamic assignment of MSI-X vectors
to SR-IOV VFs. This can be done by the PF driver after VFs are enabled,
and it can be done without affecting VFs that are already in use. The
hardware supports a limited pool of MSI-X vectors that can be assigned to
the PF or to individual VFs. This is device-specific behavior that
requires support in the PF driver.
Add a read-only "sriov_vf_total_msix" sysfs file for the PF and a writable
"sriov_vf_msix_count" file for each VF. Management software may use these
to learn how many MSI-X vectors are available and to dynamically assign
them to VFs before the VFs are passed through to a VM.
If the PF driver implements the ->sriov_get_vf_total_msix() callback,
"sriov_vf_total_msix" contains the total number of MSI-X vectors available
for distribution among VFs.
If no driver is bound to the VF, writing "N" to "sriov_vf_msix_count" uses
the PF driver ->sriov_set_msix_vec_count() callback to assign "N" MSI-X
vectors to the VF. When a VF driver subsequently reads the MSI-X Message
Control register, it will see the new Table Size "N".
Link: https://lore.kernel.org/linux-pci/20210314124256.70253-2-leon@kernel.org
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
For VFs, the Memory Space Enable bit in the Command Register is
hard-wired to 0.
Add a new bit to signify devices where the Command Register Memory
Space Enable bit does not control the device's response to MMIO
accesses.
Fixes: abafbc551f ("vfio-pci: Invalidate mmaps and block MMIO access on disabled memory")
Signed-off-by: Matthew Rosato <mjrosato@linux.ibm.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
On s390 PCI Virtual Functions (VFs) are scanned by firmware and are made
available to Linux via the hot-plug interface. As such the common code
path of doing the scan directly using the parent Physical Function (PF)
is not used and fenced off with the no_vf_scan attribute.
Even if the partition created the VFs itself e.g. using the sriov_numvfs
attribute of a PF, the PF/VF links thus need to be established after the
fact. To do this when a VF is plugged we scan through all functions on
the same zbus and test whether they are the parent PF in which case we
establish the necessary links.
With these links established there is now no more need to fence off
pci_iov_remove_virtfn() for pdev->no_vf_scan as the common code now
works fine.
Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Pierre Morel <pmorel@linux.ibm.com>
Link: https://lore.kernel.org/r/20200506154139.90609-3-schnelle@linux.ibm.com
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Currently pci_iov_add_virtfn() scans the SR-IOV BARs, adds the VF to the
bus and also creates the sysfs links between the newly added VF and its
parent PF.
With pdev->no_vf_scan fencing off the entire pci_iov_add_virtfn() call
s390 as the sole pdev->no_vf_scan user thus ends up missing these sysfs
links which are required for example by QEMU/libvirt.
Instead of duplicating the code refactor pci_iov_add_virtfn() to make
sysfs link creation callable separately.
Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Pierre Morel <pmorel@linux.ibm.com>
Link: https://lore.kernel.org/r/20200506154139.90609-1-schnelle@linux.ibm.com
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
When sriov_numvfs is being updated, we call the driver->sriov_configure()
function, which may enable VFs and call probe functions, which may make new
devices visible. This all happens before before sriov_numvfs_store()
updates sriov->num_VFs, so previously, concurrent sysfs reads of
sriov_numvfs returned stale values.
Serialize the sysfs read vs the write so the read returns the correct
num_VFs value.
[bhelgaas: hold device_lock instead of checking mutex_is_locked()]
Link: https://bugzilla.kernel.org/show_bug.cgi?id=202991
Link: https://lore.kernel.org/r/20190911072736.32091-1-pierre.cregut@orange.com
Signed-off-by: Pierre Crégut <pierre.cregut@orange.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
- Convert pci_resource_to_user() to a weak function to remove
HAVE_ARCH_PCI_RESOURCE_TO_USER #defines (Denis Efremov)
- Use PCI_SRIOV_NUM_BARS for idiomatic loop structure (Denis Efremov)
- Fix Resizable BAR size suspend/restore for 1MB BARs (Sumit Saxena)
- Correct "pci=resource_alignment" example in documentation (Alexey
Kardashevskiy)
* pci/resource:
PCI: Correct pci=resource_alignment parameter example
PCI: Restore Resizable BAR size bits correctly for 1MB BARs
PCI: Use PCI_SRIOV_NUM_BARS in loops instead of PCI_IOV_RESOURCE_END
PCI: Convert pci_resource_to_user() to a weak function
# Conflicts:
# drivers/pci/pci.c
Revert 975bb8b4dc ("PCI/IOV: Use VF0 cached config space size for other
VFs"), which attempted to cache the config space size from the first VF to
re-use for subsequent VFs.
The cached value was determined prior to discovering the PCIe capability on
the VF, which resulted in the first VF reporting the correct config space
size (4K), as it has a special case through pci_cfg_space_size(), while all
the other VFs only reported 256 bytes. As this was only a performance
optimization, we're better off without it.
Fixes: 975bb8b4dc ("PCI/IOV: Use VF0 cached config space size for other VFs")
Link: https://lore.kernel.org/r/156046663197.29869.3633634445109057665.stgit@gimli.home
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Cc: KarimAllah Ahmed <karahmed@amazon.de>
Cc: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Cc: Hao Zheng <yinhe@linux.alibaba.com>
Provide a flag to skip scanning for new VFs after SR-IOV enablement. This
can be set by implementations for which the VFs are already reported by
other means.
Signed-off-by: Sebastian Ott <sebott@linux.ibm.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Provide sriov_add_vfs() as a wrapper to scan for VFs that cleans up after
itself. This is just a code simplification. No functional change.
Signed-off-by: Sebastian Ott <sebott@linux.ibm.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
iov.c uses nothing declared in <linux/pci-ats.h>, so remove the include of
it. No functional change intended.
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Cache the config space size from VF0 and use it for all other VFs instead
of reading it from the config space of each VF. We assume that it will be
the same across all associated VFs.
This is an optimization when enabling SR-IOV on a device with many VFs.
Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de>
[bhelgaas: use CONFIG_PCI_IOV (not CONFIG_PCI_ATS)]
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
- To avoid bus errors, enable PASID only if entire path supports End-End
TLP prefixes (Sinan Kaya)
- Unify slot and bus reset functions and remove hotplug knowledge from
callers (Sinan Kaya)
- Add Function-Level Reset quirks for Intel and Samsung NVMe devices to
fix guest reboot issues (Alex Williamson)
- Add function 1 DMA alias quirk for Marvell 88SS9183 PCIe SSD Controller
(Bjorn Helgaas)
* pci/virtualization:
PCI: Add function 1 DMA alias quirk for Marvell 88SS9183
PCI: Delay after FLR of Intel DC P3700 NVMe
PCI: Disable Samsung SM961/PM961 NVMe before FLR
PCI: Export pcie_has_flr()
PCI: Rename pci_try_reset_bus() to pci_reset_bus()
PCI: Deprecate pci_reset_bus() and pci_reset_slot() functions
PCI: Unify try slot and bus reset API
PCI: Hide pci_reset_bridge_secondary_bus() from drivers
IB/hfi1: Use pci_try_reset_bus() for initiating PCI Secondary Bus Reset
PCI: Handle error return from pci_reset_bridge_secondary_bus()
PCI/IOV: Tidy pci_sriov_set_totalvfs()
PCI: Enable PASID only if entire path supports End-End TLP prefixes
# Conflicts:
# drivers/pci/hotplug/pciehp_hpc.c
The TotalVFs register in the SR-IOV capability is the hardware limit on the
number of VFs. A PF driver can limit the number of VFs further with
pci_sriov_set_totalvfs(). When the PF driver is removed, reset any VF
limit that was imposed by the driver because that limit may not apply to
other drivers.
Before 8d85a7a4f2 ("PCI/IOV: Allow PF drivers to limit total_VFs to 0"),
pci_sriov_set_totalvfs(pdev, 0) meant "we can enable TotalVFs virtual
functions", and the nfp driver used that to remove the VF limit when the
driver unloads.
8d85a7a4f2 broke that because instead of removing the VF limit,
pci_sriov_set_totalvfs(pdev, 0) actually sets the limit to zero, and that
limit persists even if another driver is loaded.
We could fix that by making the nfp driver reset the limit when it unloads,
but it seems more robust to do it in the PCI core instead of relying on the
driver.
The regression scenario is:
nfp_pci_probe (driver 1)
...
nfp_pci_remove
pci_sriov_set_totalvfs(pf->pdev, 0) # limits VFs to 0
...
nfp_pci_probe (driver 2)
nfp_rtsym_read_le("nfd_vf_cfg_max_vfs")
# no VF limit from firmware
Now driver 2 is broken because the VF limit is still 0 from driver 1.
Fixes: 8d85a7a4f2 ("PCI/IOV: Allow PF drivers to limit total_VFs to 0")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
[bhelgaas: changelog, rename functions]
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Some SR-IOV PF drivers implement .sriov_configure(), which allows
user-space to enable VFs by writing the desired number of VFs to the sysfs
"sriov_numvfs" file (see sriov_numvfs_store()).
The PCI core limits the number of VFs to the TotalVFs advertised by the
device in its SR-IOV capability. The PF driver can limit the number of VFs
to even fewer (it may have pre-allocated data structures or knowledge of
device limitations) by calling pci_sriov_set_totalvfs(), but previously it
could not limit the VFs to 0.
Change pci_sriov_get_totalvfs() so it always respects the VF limit imposed
by the PF driver, even if the limit is 0.
This sequence:
pci_sriov_set_totalvfs(dev, 0);
x = pci_sriov_get_totalvfs(dev);
previously set "x" to TotalVFs from the SR-IOV capability. Now it will set
"x" to 0.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
[bhelgaas: split to separate patch]
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
SR-IOV (Single Root I/O Virtualization) is an optional PCIe capability (see
PCIe r4.0, sec 9). A PCIe Function with the SR-IOV capability is referred
to as a PF (Physical Function). If SR-IOV is enabled on the PF, several
VFs (Virtual Functions) may be created. The VFs can be individually
assigned to virtual machines, which allows them to share a single hardware
device while being isolated from each other.
Some SR-IOV devices have resources such as queues and interrupts that must
be set up in the PF before enabling the VFs, so they require a PF driver to
do that.
Other SR-IOV devices don't require any PF setup before enabling VFs. Add a
pci_sriov_configure_simple() interface so PF drivers for such devices can
use it without repeating the VF-enabling code.
Tested-by: Mark Rustad <mark.d.rustad@intel.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
[bhelgaas: changelog, comment]
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Greg Rose <gvrose8192@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>:wq
- probe for device reset support during enumeration instead of runtime
(Bjorn Helgaas)
- add ACS quirk for Ampere (née APM) root ports (Feng Kan)
- add function 1 DMA alias quirk for Marvell 88SE9220 (Thomas
Vincent-Cross)
- protect device restore with device lock (Sinan Kaya)
- handle failure of FLR gracefully (Sinan Kaya)
- handle CRS (config retry status) after device resets (Sinan Kaya)
- skip various config reads for SR-IOV VFs as an optimization (KarimAllah
Ahmed)
* pci/virtualization:
PCI/IOV: Add missing prototypes for powerpc pcibios interfaces
PCI/IOV: Use VF0 cached config registers for other VFs
PCI/IOV: Skip BAR sizing for VFs
PCI/IOV: Skip INTx config reads for VFs
PCI: Wait for device to become ready after secondary bus reset
PCI: Add a return type for pci_reset_bridge_secondary_bus()
PCI: Wait for device to become ready after a power management reset
PCI: Rename pci_flr_wait() to pci_dev_wait() and make it generic
PCI: Handle FLR failure and allow other reset types
PCI: Protect restore with device lock to be consistent
PCI: Add function 1 DMA alias quirk for Marvell 88SE9220
PCI: Add ACS quirk for Ampere root ports
PCI: Remove redundant probes for device reset support
PCI: Probe for device reset support during enumeration
Conflicts:
include/linux/pci.h
Cache some config data from VF0 and use it for all other VFs instead of
reading it from the config space of each VF. We assume these items are the
same across all associated VFs:
Revision ID
Class Code
Subsystem Vendor ID
Subsystem ID
This is an optimization when enabling SR-IOV on a device with many VFs.
Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de>
[bhelgaas: changelog, simplify comments, remove unused "device", test
CONFIG_PCI_IOV instead of CONFIG_PCI_ATS, rename functions]
Signed-off-by: Bjorn Helgaas <helgaas@kernel.org>