713 Commits

Author SHA1 Message Date
Thomas Gleixner
0edc78b82b x86/msi: Make irq_retrigger() functional for posted MSI
Luigi reported that retriggering a posted MSI interrupt does not work
correctly.

The reason is that the retrigger happens at the vector domain by sending an
IPI to the actual vector on the target CPU. That works correctly exactly
once because the posted MSI interrupt chip does not issue an EOI as that's
only required for the posted MSI notification vector itself.

As a consequence the vector becomes stale in the ISR, which not only
affects this vector but also any lower priority vector in the affected
APIC because the ISR bit is not cleared.

Luigi proposed to set the vector in the remap PIR bitmap and raise the
posted MSI notification vector. That works, but that still does not cure a
related problem:

  If there is ever a stray interrupt on such a vector, then the related
  APIC ISR bit becomes stale due to the lack of EOI as described above.
  Unlikely to happen, but if it happens it's not debuggable at all.

So instead of playing games with the PIR, this can be actually solved
for both cases by:

 1) Keeping track of the posted interrupt vector handler state

 2) Implementing a posted MSI specific irq_ack() callback which checks that
    state. If the posted vector handler is inactive it issues an EOI,
    otherwise it delegates that to the posted handler.

This is correct versus affinity changes and concurrent events on the posted
vector as the actual handler invocation is serialized through the interrupt
descriptor lock.

Fixes: ed1e48ea43 ("iommu/vt-d: Enable posted mode for device MSIs")
Reported-by: Luigi Rizzo <lrizzo@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Luigi Rizzo <lrizzo@google.com>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20251125214631.044440658@linutronix.de
Closes: https://lore.kernel.org/lkml/20251124104836.3685533-1-lrizzo@google.com
2025-12-17 18:41:52 +01:00
Linus Torvalds
208eed95fc Merge tag 'soc-drivers-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
Pull SoC driver updates from Arnd Bergmann:
 "This is the first half of the driver changes:

   - A treewide interface change to the "syscore" operations for power
     management, as a preparation for future Tegra specific changes

   - Reset controller updates with added drivers for LAN969x, eic770 and
     RZ/G3S SoCs

   - Protection of system controller registers on Renesas and Google
     SoCs, to prevent trivially triggering a system crash from e.g.
     debugfs access

   - soc_device identification updates on Nvidia, Exynos and Mediatek

   - debugfs support in the ST STM32 firewall driver

   - Minor updates for SoC drivers on AMD/Xilinx, Renesas, Allwinner, TI

   - Cleanups for memory controller support on Nvidia and Renesas"

* tag 'soc-drivers-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (114 commits)
  memory: tegra186-emc: Fix missing put_bpmp
  Documentation: reset: Remove reset_controller_add_lookup()
  reset: fix BIT macro reference
  reset: rzg2l-usbphy-ctrl: Fix a NULL vs IS_ERR() bug in probe
  reset: th1520: Support reset controllers in more subsystems
  reset: th1520: Prepare for supporting multiple controllers
  dt-bindings: reset: thead,th1520-reset: Add controllers for more subsys
  dt-bindings: reset: thead,th1520-reset: Remove non-VO-subsystem resets
  reset: remove legacy reset lookup code
  clk: davinci: psc: drop unused reset lookup
  reset: rzg2l-usbphy-ctrl: Add support for RZ/G3S SoC
  reset: rzg2l-usbphy-ctrl: Add support for USB PWRRDY
  dt-bindings: reset: renesas,rzg2l-usbphy-ctrl: Document RZ/G3S support
  reset: eswin: Add eic7700 reset driver
  dt-bindings: reset: eswin: Documentation for eic7700 SoC
  reset: sparx5: add LAN969x support
  dt-bindings: reset: microchip: Add LAN969x support
  soc: rockchip: grf: Add select correct PWM implementation on RK3368
  soc/tegra: pmc: Add USB wake events for Tegra234
  amba: tegra-ahb: Fix device leak on SMMU enable
  ...
2025-12-05 17:29:04 -08:00
Joerg Roedel
0d081b1694 Merge branches 'arm/smmu/updates', 'arm/smmu/bindings', 'mediatek', 'nvidia/tegra', 'intel/vt-d', 'amd/amd-vi' and 'core' into next 2025-11-28 08:44:21 +01:00
Jason Gunthorpe
1eb0ae6fbd iommupt/vtd: Support mgaw's less than a 4 level walk for first stage
If the IOVA is limited to less than 48 the page table will be constructed
with a 3 level configuration which is unsupported by hardware.

Like the second stage the caller needs to pass in both the top_level an
the vasz to specify a table that has more levels than required to hold the
IOVA range.

Fixes: 6cbc09b771 ("iommu/vt-d: Restore previous domain::aperture_end calculation")
Reported-by: Calvin Owens <calvin@wbinvd.org>
Closes: https://lore.kernel.org/r/8f257d2651eb8a4358fcbd47b0145002e5f1d638.1764237717.git.calvin@wbinvd.org
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Tested-by: Calvin Owens <calvin@wbinvd.org>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-28 08:43:55 +01:00
Jason Gunthorpe
d856f9d278 iommupt/vtd: Allow VT-d to have a larger table top than the vasz requires
VT-d second stage HW specifies both the maximum IOVA and the supported
table walk starting points. Weirdly there is HW that only supports a 4
level walk but has a maximum IOVA that only needs 3.

The current code miscalculates this and creates a wrongly sized page table
which ultimately fails the compatibility check for number of levels.

This is fixed by allowing the page table to be created with both a vasz
and top_level input. The vasz will set the aperture for the domain while
the top_level will set the page table geometry.

Add top_level to vtdss and correct the logic in VT-d to generate the right
top_level and vasz from mgaw and sagaw.

Fixes: d373449d8e ("iommu/vt-d: Use the generic iommu page table")
Reported-by: Calvin Owens <calvin@wbinvd.org>
Closes: https://lore.kernel.org/r/8f257d2651eb8a4358fcbd47b0145002e5f1d638.1764237717.git.calvin@wbinvd.org
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Tested-by: Calvin Owens <calvin@wbinvd.org>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-28 08:43:03 +01:00
Lu Baolu
6cbc09b771 iommu/vt-d: Restore previous domain::aperture_end calculation
Commit d373449d8e ("iommu/vt-d: Use the generic iommu page table")
changed the calculation of domain::aperture_end. Previously, it was
calculated as:

    domain->domain.geometry.aperture_end =
            __DOMAIN_MAX_ADDR(domain->gaw - 1);

where domain->gaw was limited to less than MGAW.

Currently, it is calculated purely based on the max level of the page
table that the hardware supports. This is incorrect as stated in Section
3.6 of the VT-d spec:

  "Software using first-stage translation structures to translate an IO
   Virtual Address (IOVA) must use canonical addresses. Additionally,
   software must limit addresses to less than the minimum of MGAW and the
   lower canonical address width implied by FSPM (i.e., 47-bit when FSPM
   is 4-level and 56-bit when FSPM is 5-level)."

Restore the previous calculation method for domain::aperture_end to avoid
violating the spec. Incorrect aperture calculation causes GPU hangs
without generating VT-d faults on some Intel client platforms.

Fixes: d373449d8e ("iommu/vt-d: Use the generic iommu page table")
Reported-by: Chaitanya Kumar Borah <chaitanya.kumar.borah@intel.com>
Closes: https://lore.kernel.org/r/4f15cf3b-6fad-4cd8-87e5-6d86c0082673@intel.com
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Suggested-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-20 11:35:15 +01:00
Aashish Sharma
6b38a108ee iommu/vt-d: Fix unused invalidation hint in qi_desc_iotlb
Invalidation hint (ih) in the function 'qi_desc_iotlb' is initialized
to zero and never used. It is embedded in the 0th bit of the 'addr'
parameter. Get the correct 'ih' value from there.

Fixes: f701c9f36b ("iommu/vt-d: Factor out invalidation descriptor composition")
Signed-off-by: Aashish Sharma <aashish@aashishsharma.net>
Link: https://lore.kernel.org/r/20251009010903.1323979-1-aashish@aashishsharma.net
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-20 11:33:04 +01:00
Vineeth Pillai (Google)
cb3db5a39e iommu/vt-d: Set INTEL_IOMMU_FLOPPY_WA depend on BLK_DEV_FD
INTEL_IOMMU_FLOPPY_WA workaround was introduced to create direct mappings
for first 16MB for floppy devices as the floppy drivers were not using
dma apis. We need not do this direct map if floppy driver is not
enabled.

INTEL_IOMMU_FLOPPY_WA is generally not a good idea. Iommu will be
mapping pages in this address range while kernel would also be
allocating from this range(mostly on memory stress). A misbehaving
device using this domain will have access to the pages that the
kernel might be actively using. We noticed this while running a test
that was trying to figure out if any pages used by kernel is in iommu
page tables.

This patch reduces the scope of the above issue by disabling the
workaround when floppy driver is not enabled. But we would still need to
fix the floppy driver to use dma apis so that we need not do direct map
without reserving the pages. Or the other option is to reserve this
memory range in firmware so that kernel will not use the pages.

Fixes: d850c2ee5f ("iommu/vt-d: Expose ISA direct mapping region via iommu_get_resv_regions")
Fixes: 49a0429e53 ("Intel IOMMU: Iommu floppy workaround")
Signed-off-by: Vineeth Pillai (Google) <vineeth@bitbyteword.org>
Link: https://lore.kernel.org/r/20251002161625.1155133-1-vineeth@bitbyteword.org
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-20 11:33:04 +01:00
Thierry Reding
a97fbc3ee3 syscore: Pass context data to callbacks
Several drivers can benefit from registering per-instance data along
with the syscore operations. To achieve this, move the modifiable fields
out of the syscore_ops structure and into a separate struct syscore that
can be registered with the framework. Add a void * driver data field for
drivers to store contextual data that will be passed to the syscore ops.

Acked-by: Rafael J. Wysocki (Intel) <rafael@kernel.org>
Signed-off-by: Thierry Reding <treding@nvidia.com>
2025-11-14 10:01:52 +01:00
Jason Gunthorpe
101a285411 iommu/vt-d: Follow PT_FEAT_DMA_INCOHERENT into the PASID entry
Currently a incoherent walk domain cannot be attached to a coherent
capable iommu. Kevin says HW probably doesn't exist with such a mixture,
but making the driver support it makes logical sense anyhow.

When building the PASID entry the PWSNP (Page Walk Snoop) bit tells the HW
if it should issue snoops. If the page table is cache flushed because of
PT_FEAT_DMA_INCOHERENT then it is fine to set this bit to 0 even if the HW
supports 1.

Weaken the compatible check to permit a coherent instance to accept an
incoherent table and fix the PASID table construction to set PWSNP from
PT_FEAT_DMA_INCOHERENT.

SVA always sets PWSNP.

Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-05 09:50:20 +01:00
Jason Gunthorpe
d373449d8e iommu/vt-d: Use the generic iommu page table
Replace the VT-d iommu_domain implementation of the VT-d second stage and
first stage page tables with the iommupt VTDSS and x86_64
pagetables. x86_64 is shared with the AMD driver.

There are a couple notable things in VT-d:
- Like AMD the second stage format is not sign extended, unlike AMD it
  cannot decode a full 64 bits. The first stage format is a normal sign
  extended x86 page table
- The HW caps can indicate how many levels, how many address bits and what
  leaf page sizes are supported in HW. As before the highest number of
  levels that can translate the entire supported address width is used.
  The supported page sizes are adjusted directly from the dedicated
  first/second stage cap bits.
- VTD requires flushing 'write buffers'. This logic is left unchanged,
  the write buffer flushes on any gather flush or through iotlb_sync_map.
- Like ARM, VTD has an optional non-coherent page table walker that
  requires cache flushing. This is supported through PT_FEAT_DMA_INCOHERENT
  the same as ARM, however x86 can't use the DMA API for flush, it must
  call the arch function clflush_cache_range()
- The PT_FEAT_DYNAMIC_TOP can probably be supported on VT-d someday for the
  second stage when it uses 128 bit atomic stores for the HW context
  structures.
- PT_FEAT_VTDSS_FORCE_WRITEABLE is used to work around ERRATA_772415_SPR17
- A kernel command line parameter "sp_off" disables all page sizes except
  4k

Remove all the unused iommu_domain page table code. The debugfs paths have
their own independent page table walker that is left alone for now.

This corrects a race with the non-coherent walker that the ARM
implementations have fixed:

     CPU 0                               CPU 1
  pfn_to_dma_pte()                    pfn_to_dma_pte()
   pte = &parent[offset];
   if (!dma_pte_present(pte)) {
     try_cmpxchg64(&pte->val)
					pte = &parent[offset];
					.. dma_pte_present(pte) ..
				        [...]
					// iommu_map() completes
					// Device does DMA
     domain_flush_cache(pte)

The CPU 1 mapping operation shares a page table level with the CPU 0
mapping operation. CPU 0 installed a new page table level but has not
flushed it yet. CPU1 returns from iommu_map() and the device does DMA. The
non coherent walker fails to see the new table level installed by CPU 0
and fails the DMA with non-present.

The iommupt PT_FEAT_DMA_INCOHERENT implementation uses the ARM design of
storing a flag when CPU 0 completes the flush. If the flag is not set CPU
1 will also flush to ensure the HW can fully walk to the PTE being
installed.

Cc: Tina Zhang <tina.zhang@intel.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-05 09:50:19 +01:00
Nicolin Chen
fd714986e4 iommu: Pass in old domain to attach_dev callback functions
The IOMMU core attaches each device to a default domain on probe(). Then,
every new "attach" operation has a fundamental meaning of two-fold:
 - detach from its currently attached (old) domain
 - attach to a given new domain

Modern IOMMU drivers following this pattern usually want to clean up the
things related to the old domain, so they call iommu_get_domain_for_dev()
to fetch the old domain.

Pass in the old domain pointer from the core to drivers, aligning with the
set_dev_pasid op that does so already.

Ensure all low-level attach fcuntions in the core can forward the correct
old domain pointer. Thus, rework those functions as well.

Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-10-27 13:55:35 +01:00
Joerg Roedel
5f4b8c03f4 Merge branches 'apple/dart', 'ti/omap', 'riscv', 'intel/vt-d' and 'amd/amd-vi' into next 2025-09-26 10:03:33 +02:00
Lu Baolu
57f55048e5 iommu/vt-d: Disallow dirty tracking if incoherent page walk
Dirty page tracking relies on the IOMMU atomically updating the dirty bit
in the paging-structure entry. For this operation to succeed, the paging-
structure memory must be coherent between the IOMMU and the CPU. In
another word, if the iommu page walk is incoherent, dirty page tracking
doesn't work.

The Intel VT-d specification, Section 3.10 "Snoop Behavior" states:

"Remapping hardware encountering the need to atomically update A/EA/D bits
 in a paging-structure entry that is not snooped will result in a non-
 recoverable fault."

To prevent an IOMMU from being incorrectly configured for dirty page
tracking when it is operating in an incoherent mode, mark SSADS as
supported only when both ecap_slads and ecap_smpwc are supported.

Fixes: f35f22cc76 ("iommu/vt-d: Access/Dirty bit support for SS domains")
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20250924083447.123224-1-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-09-26 10:02:26 +02:00
Lu Baolu
5bd5ab53e7 iommu/vt-d: debugfs: Avoid dumping context command register
The register-based cache invalidation interface is in the process of being
replaced by the queued invalidation interface. The VT-d architecture
allows hardware implementations with a queued invalidation interface to
not implement the registers used for cache invalidation. Currently, the
debugfs interface dumps the Context Command Register unconditionally,
which is not reasonable.

Remove it to avoid potential access to non-present registers.

Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20250917025051.143853-1-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-09-19 09:43:21 +02:00
Lu Baolu
7d4e404102 iommu/vt-d: Removal of Advanced Fault Logging
The advanced fault logging has been removed from the specification since
v4.0. Linux doesn't implement advanced fault logging functionality, but
it currently dumps the advanced logging registers through debugfs. Remove
the dumping of these advanced fault logging registers through debugfs to
avoid potential access to non-present registers.

Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20250917024850.143801-1-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-09-19 09:43:21 +02:00
Lu Baolu
5ef7e24c74 iommu/vt-d: PRS isn't usable if PDS isn't supported
The specification, Section 7.10, "Software Steps to Drain Page Requests &
Responses," requires software to submit an Invalidation Wait Descriptor
(inv_wait_dsc) with the Page-request Drain (PD=1) flag set, along with
the Invalidation Wait Completion Status Write flag (SW=1). It then waits
for the Invalidation Wait Descriptor's completion.

However, the PD field in the Invalidation Wait Descriptor is optional, as
stated in Section 6.5.2.9, "Invalidation Wait Descriptor":

"Page-request Drain (PD): Remapping hardware implementations reporting
 Page-request draining as not supported (PDS = 0 in ECAP_REG) treat this
 field as reserved."

This implies that if the IOMMU doesn't support the PDS capability, software
can't drain page requests and group responses as expected.

Do not enable PCI/PRI if the IOMMU doesn't support PDS.

Reported-by: Joel Granados <joel.granados@kernel.org>
Closes: https://lore.kernel.org/r/20250909-jag-pds-v1-1-ad8cba0e494e@kernel.org
Fixes: 66ac4db36f ("iommu/vt-d: Add page request draining support")
Cc: stable@vger.kernel.org
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20250915062946.120196-1-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-09-19 09:43:21 +02:00
Lu Baolu
4402e8f39d iommu/vt-d: Remove LPIG from page group response descriptor
Bit 66 in the page group response descriptor used to be the LPIG (Last
Page in Group), but it was marked as Reserved since Specification 4.0.
Remove programming on this bit to make it consistent with the latest
specification.

Existing hardware all treats bit 66 of the page group response descriptor
as "ignored", therefore this change doesn't break any existing hardware.

Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20250901053943.1708490-1-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-09-19 09:43:20 +02:00
Yury Norov (NVIDIA)
4c48101364 iommu/vt-d: Drop unused cap_super_offset()
The macro is unused. Drop the dead code.

Signed-off-by: Yury Norov (NVIDIA) <yury.norov@gmail.com>
Link: https://lore.kernel.org/r/20250913015024.81186-1-yury.norov@gmail.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-09-19 09:43:20 +02:00
Vineeth Pillai (Google)
fbe6070c73 iommu/vt-d: debugfs: Fix legacy mode page table dump logic
In legacy mode, SSPTPTR is ignored if TT is not 00b or 01b. SSPTPTR
maybe uninitialized or zero in that case and may cause oops like:

 Oops: general protection fault, probably for non-canonical address
       0xf00087d3f000f000: 0000 [#1] SMP NOPTI
 CPU: 2 UID: 0 PID: 786 Comm: cat Not tainted 6.16.0 #191 PREEMPT(voluntary)
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.17.0-5.fc42 04/01/2014
 RIP: 0010:pgtable_walk_level+0x98/0x150
 RSP: 0018:ffffc90000f279c0 EFLAGS: 00010206
 RAX: 0000000040000000 RBX: ffffc90000f27ab0 RCX: 000000000000001e
 RDX: 0000000000000003 RSI: f00087d3f000f000 RDI: f00087d3f0010000
 RBP: ffffc90000f27a00 R08: ffffc90000f27a98 R09: 0000000000000002
 R10: 0000000000000000 R11: 0000000000000000 R12: f00087d3f000f000
 R13: 0000000000000000 R14: 0000000040000000 R15: ffffc90000f27a98
 FS:  0000764566dcb740(0000) GS:ffff8881f812c000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000764566d44000 CR3: 0000000109d81003 CR4: 0000000000772ef0
 PKRU: 55555554
 Call Trace:
  <TASK>
  pgtable_walk_level+0x88/0x150
  domain_translation_struct_show.isra.0+0x2d9/0x300
  dev_domain_translation_struct_show+0x20/0x40
  seq_read_iter+0x12d/0x490
...

Avoid walking the page table if TT is not 00b or 01b.

Fixes: 2b437e8045 ("iommu/vt-d: debugfs: Support dumping a specified page table")
Signed-off-by: Vineeth Pillai (Google) <vineeth@bitbyteword.org>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20250814163153.634680-1-vineeth@bitbyteword.org
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-09-19 09:43:20 +02:00
Seyediman Seyedarab
75c02a0376 iommu/vt-d: Replace snprintf with scnprintf in dmar_latency_snapshot()
snprintf() returns the number of bytes that would have been written, not
the number actually written. Using this for offset tracking can cause
buffer overruns if truncation occurs.

Replace snprintf() with scnprintf() to ensure the offset stays within
bounds.

Since scnprintf() never returns a negative value, and zero is not possible
in this context because 'bytes' starts at 0 and 'size - bytes' is
DEBUG_BUFFER_SIZE in the first call, which is large enough to hold the
string literals used, the return value is always positive. An integer
overflow is also completely out of reach here due to the small and fixed
buffer size. The error check in latency_show_one() is therefore
unnecessary. Remove it and make dmar_latency_snapshot() return void.

Signed-off-by: Seyediman Seyedarab <ImanDevel@gmail.com>
Link: https://lore.kernel.org/r/20250731225048.131364-1-ImanDevel@gmail.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-09-19 09:43:19 +02:00
Eugene Koira
dce043c07c iommu/vt-d: Fix __domain_mapping()'s usage of switch_to_super_page()
switch_to_super_page() assumes the memory range it's working on is aligned
to the target large page level. Unfortunately, __domain_mapping() doesn't
take this into account when using it, and will pass unaligned ranges
ultimately freeing a PTE range larger than expected.

Take for example a mapping with the following iov_pfn range [0x3fe400,
0x4c0600), which should be backed by the following mappings:

   iov_pfn [0x3fe400, 0x3fffff] covered by 2MiB pages
   iov_pfn [0x400000, 0x4bffff] covered by 1GiB pages
   iov_pfn [0x4c0000, 0x4c05ff] covered by 2MiB pages

Under this circumstance, __domain_mapping() will pass [0x400000, 0x4c05ff]
to switch_to_super_page() at a 1 GiB granularity, which will in turn
free PTEs all the way to iov_pfn 0x4fffff.

Mitigate this by rounding down the iov_pfn range passed to
switch_to_super_page() in __domain_mapping()
to the target large page level.

Additionally add range alignment checks to switch_to_super_page.

Fixes: 9906b9352a ("iommu/vt-d: Avoid duplicate removing in __domain_mapping()")
Signed-off-by: Eugene Koira <eugkoira@amazon.com>
Cc: stable@vger.kernel.org
Reviewed-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Link: https://lore.kernel.org/r/20250826143816.38686-1-eugkoira@amazon.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-09-05 15:11:08 +02:00
Linus Torvalds
0bd0a41a51 Merge tag 'pci-v6.17-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci
Pull PCI updates from Bjorn Helgaas:
 "Enumeration:

   - Allow built-in drivers, not just modular drivers, to use async
     initial probing (Lukas Wunner)

   - Support Immediate Readiness even on devices with no PM Capability
     (Sean Christopherson)

   - Consolidate definition of PCIE_RESET_CONFIG_WAIT_MS (100ms), the
     required delay between a reset and sending config requests to a
     device (Niklas Cassel)

   - Add pci_is_display() to check for "Display" base class and use it
     in ALSA hda, vfio, vga_switcheroo, vt-d (Mario Limonciello)

   - Allow 'isolated PCI functions' (multi-function devices without a
     function 0) for LoongArch, similar to s390 and jailhouse (Huacai
     Chen)

  Power control:

   - Add ability to enable optional slot clock for cases where the PCIe
     host controller and the slot are supplied by different clocks
     (Marek Vasut)

  PCIe native device hotplug:

   - Fix runtime PM ref imbalance on Hot-Plug Capable ports caused by
     misinterpreting a config read failure after a device has been
     removed (Lukas Wunner)

   - Avoid creating a useless PCIe port service device for pciehp if the
     slot is handled by the ACPI hotplug driver (Lukas Wunner)

   - Ignore ACPI hotplug slots when calculating depth of pciehp hotplug
     ports (Lukas Wunner)

  Virtualization:

   - Save VF resizable BAR state and restore it after reset (Michał
     Winiarski)

   - Allow IOV resources (VF BARs) to be resized (Michał Winiarski)

   - Add pci_iov_vf_bar_set_size() so drivers can control VF BAR size
     (Michał Winiarski)

  Endpoint framework:

   - Add RC-to-EP doorbell support using platform MSI controller,
     including a test case (Frank Li)

   - Allow BAR assignment via configfs so platforms have flexibility in
     determining BAR usage (Jerome Brunet)

  Native PCIe controller drivers:

   - Convert amazon,al-alpine-v[23]-pcie, apm,xgene-pcie,
     axis,artpec6-pcie, marvell,armada-3700-pcie, st,spear1340-pcie to
     DT schema format (Rob Herring)

   - Use dev_fwnode() instead of of_fwnode_handle() to remove OF
     dependency in altera (fixes an unused variable), designware-host,
     mediatek, mediatek-gen3, mobiveil, plda, xilinx, xilinx-dma,
     xilinx-nwl (Jiri Slaby, Arnd Bergmann)

   - Convert aardvark, altera, brcmstb, designware-host, iproc,
     mediatek, mediatek-gen3, mobiveil, plda, rcar-host, vmd, xilinx,
     xilinx-dma, xilinx-nwl from using pci_msi_create_irq_domain() to
     using msi_create_parent_irq_domain() instead; this makes the
     interrupt controller per-PCI device, allows dynamic allocation of
     vectors after initialization, and allows support of IMS (Nam Cao)

  APM X-Gene PCIe controller driver:

   - Rewrite MSI handling to MSI CPU affinity, drop useless CPU hotplug
     bits, use device-managed memory allocations, and clean things up
     (Marc Zyngier)

   - Probe xgene-msi as a standard platform driver rather than a
     subsys_initcall (Marc Zyngier)

  Broadcom STB PCIe controller driver:

   - Add optional DT 'num-lanes' property and if present, use it to
     override the Maximum Link Width advertised in Link Capabilities
     (Jim Quinlan)

  Cadence PCIe controller driver:

   - Use PCIe Message routing types from the PCI core rather than
     defining private ones (Hans Zhang)

  Freescale i.MX6 PCIe controller driver:

   - Add IMX8MQ_EP third 64-bit BAR in epc_features (Richard Zhu)

   - Add IMX8MM_EP and IMX8MP_EP fixed 256-byte BAR 4 in epc_features
     (Richard Zhu)

   - Configure LUT for MSI/IOMMU in Endpoint mode so Root Complex can
     trigger doorbel on Endpoint (Frank Li)

   - Remove apps_reset (LTSSM_EN) from
     imx_pcie_{assert,deassert}_core_reset(), which fixes a hotplug
     regression on i.MX8MM (Richard Zhu)

   - Delay Endpoint link start until configfs 'start' written (Richard
     Zhu)

  Intel VMD host bridge driver:

   - Add Intel Panther Lake (PTL)-H/P/U Vendor ID (George D Sworo)

  Qualcomm PCIe controller driver:

   - Add DT binding and driver support for SA8255p, which supports ECAM
     for Configuration Space access (Mayank Rana)

   - Update DT binding and driver to describe PHYs and per-Root Port
     resets in a Root Port stanza and deprecate describing them in the
     host bridge; this makes it possible to support multiple Root Ports
     in the future (Krishna Chaitanya Chundru)

   - Add Qualcomm QCS615 to SM8150 DT binding (Ziyue Zhang)

   - Add Qualcomm QCS8300 to SA8775p DT binding (Ziyue Zhang)

   - Drop TBU and ref clocks from Qualcomm SM8150 and SC8180x DT
     bindings (Konrad Dybcio)

   - Document 'link_down' reset in Qualcomm SA8775P DT binding (Ziyue
     Zhang)

   - Add required PCIE_RESET_CONFIG_WAIT_MS delay after Link up IRQ
     (Niklas Cassel)

  Rockchip PCIe controller driver:

   - Drop unused PCIe Message routing and code definitions (Hans Zhang)

   - Remove several unused header includes (Hans Zhang)

   - Use standard PCIe config register definitions instead of
     rockchip-specific redefinitions (Geraldo Nascimento)

   - Set Target Link Speed to 5.0 GT/s before retraining so we have a
     chance to train at a higher speed (Geraldo Nascimento)

  Rockchip DesignWare PCIe controller driver:

   - Prevent race between link training and register update via DBI by
     inhibiting link training after hot reset and link down (Wilfred
     Mallawa)

   - Add required PCIE_RESET_CONFIG_WAIT_MS delay after Link up IRQ
     (Niklas Cassel)

  Sophgo PCIe controller driver:

   - Add DT binding and driver for Sophgo SG2044 PCIe controller driver
     in Root Complex mode (Inochi Amaoto)

  Synopsys DesignWare PCIe controller driver:

   - Add required PCIE_RESET_CONFIG_WAIT_MS after waiting for Link up on
     Ports that support > 5.0 GT/s. Slower Ports still rely on the
     not-quite-correct PCIE_LINK_WAIT_SLEEP_MS 90ms default delay while
     waiting for the Link (Niklas Cassel)"

* tag 'pci-v6.17-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci: (116 commits)
  dt-bindings: PCI: qcom,pcie-sa8775p: Document 'link_down' reset
  dt-bindings: PCI: Remove 83xx-512x-pci.txt
  dt-bindings: PCI: Convert amazon,al-alpine-v[23]-pcie to DT schema
  dt-bindings: PCI: Convert marvell,armada-3700-pcie to DT schema
  dt-bindings: PCI: Convert apm,xgene-pcie to DT schema
  dt-bindings: PCI: Convert axis,artpec6-pcie to DT schema
  dt-bindings: PCI: Convert st,spear1340-pcie to DT schema
  PCI: Move is_pciehp check out of pciehp_is_native()
  PCI: pciehp: Use is_pciehp instead of is_hotplug_bridge
  PCI/portdrv: Use is_pciehp instead of is_hotplug_bridge
  PCI/ACPI: Fix runtime PM ref imbalance on Hot-Plug Capable ports
  selftests: pci_endpoint: Add doorbell test case
  misc: pci_endpoint_test: Add doorbell test case
  PCI: endpoint: pci-epf-test: Add doorbell test support
  PCI: endpoint: Add pci_epf_align_inbound_addr() helper for inbound address alignment
  PCI: endpoint: pci-ep-msi: Add checks for MSI parent and mutability
  PCI: endpoint: Add RC-to-EP doorbell support using platform MSI controller
  PCI: dwc: Add Sophgo SG2044 PCIe controller driver in Root Complex mode
  PCI: vmd: Switch to msi_create_parent_irq_domain()
  PCI: vmd: Convert to lock guards
  ...
2025-08-01 13:59:07 -07:00
Linus Torvalds
c93529ad4f Merge tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd
Pull iommufd updates from Jason Gunthorpe:
 "This broadly brings the assigned HW command queue support to iommufd.
  This feature is used to improve SVA performance in VMs by avoiding
  paravirtualization traps during SVA invalidations.

  Along the way I think some of the core logic is in a much better state
  to support future driver backed features.

  Summary:

   - IOMMU HW now has features to directly assign HW command queues to a
     guest VM. In this mode the command queue operates on a limited set
     of invalidation commands that are suitable for improving guest
     invalidation performance and easy for the HW to virtualize.

     This brings the generic infrastructure to allow IOMMU drivers to
     expose such command queues through the iommufd uAPI, mmap the
     doorbell pages, and get the guest physical range for the command
     queue ring itself.

   - An implementation for the NVIDIA SMMUv3 extension "cmdqv" is built
     on the new iommufd command queue features. It works with the
     existing SMMU driver support for cmdqv in guest VMs.

   - Many precursor cleanups and improvements to support the above
     cleanly, changes to the general ioctl and object helpers, driver
     support for VDEVICE, and mmap pgoff cookie infrastructure.

   - Sequence VDEVICE destruction to always happen before VFIO device
     destruction. When using the above type features, and also in future
     confidential compute, the internal virtual device representation
     becomes linked to HW or CC TSM configuration and objects. If a VFIO
     device is removed from iommufd those HW objects should also be
     cleaned up to prevent a sort of UAF. This became important now that
     we have HW backing the VDEVICE.

   - Fix one syzkaller found error related to math overflows during iova
     allocation"

* tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd: (57 commits)
  iommu/arm-smmu-v3: Replace vsmmu_size/type with get_viommu_size
  iommu/arm-smmu-v3: Do not bother impl_ops if IOMMU_VIOMMU_TYPE_ARM_SMMUV3
  iommufd: Rename some shortterm-related identifiers
  iommufd/selftest: Add coverage for vdevice tombstone
  iommufd/selftest: Explicitly skip tests for inapplicable variant
  iommufd/vdevice: Remove struct device reference from struct vdevice
  iommufd: Destroy vdevice on idevice destroy
  iommufd: Add a pre_destroy() op for objects
  iommufd: Add iommufd_object_tombstone_user() helper
  iommufd/viommu: Roll back to use iommufd_object_alloc() for vdevice
  iommufd/selftest: Test reserved regions near ULONG_MAX
  iommufd: Prevent ALIGN() overflow
  iommu/tegra241-cmdqv: import IOMMUFD module namespace
  iommufd: Do not allow _iommufd_object_alloc_ucmd if abort op is set
  iommu/tegra241-cmdqv: Add IOMMU_VEVENTQ_TYPE_TEGRA241_CMDQV support
  iommu/tegra241-cmdqv: Add user-space use support
  iommu/tegra241-cmdqv: Do not statically map LVCMDQs
  iommu/tegra241-cmdqv: Simplify deinit flow in tegra241_cmdqv_remove_vintf()
  iommu/tegra241-cmdqv: Use request_threaded_irq
  iommu/arm-smmu-v3-iommufd: Add hw_info to impl_ops
  ...
2025-07-31 12:43:08 -07:00
Linus Torvalds
63eb28bb14 Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
 "ARM:

   - Host driver for GICv5, the next generation interrupt controller for
     arm64, including support for interrupt routing, MSIs, interrupt
     translation and wired interrupts

   - Use FEAT_GCIE_LEGACY on GICv5 systems to virtualize GICv3 VMs on
     GICv5 hardware, leveraging the legacy VGIC interface

   - Userspace control of the 'nASSGIcap' GICv3 feature, allowing
     userspace to disable support for SGIs w/o an active state on
     hardware that previously advertised it unconditionally

   - Map supporting endpoints with cacheable memory attributes on
     systems with FEAT_S2FWB and DIC where KVM no longer needs to
     perform cache maintenance on the address range

   - Nested support for FEAT_RAS and FEAT_DoubleFault2, allowing the
     guest hypervisor to inject external aborts into an L2 VM and take
     traps of masked external aborts to the hypervisor

   - Convert more system register sanitization to the config-driven
     implementation

   - Fixes to the visibility of EL2 registers, namely making VGICv3
     system registers accessible through the VGIC device instead of the
     ONE_REG vCPU ioctls

   - Various cleanups and minor fixes

  LoongArch:

   - Add stat information for in-kernel irqchip

   - Add tracepoints for CPUCFG and CSR emulation exits

   - Enhance in-kernel irqchip emulation

   - Various cleanups

  RISC-V:

   - Enable ring-based dirty memory tracking

   - Improve perf kvm stat to report interrupt events

   - Delegate illegal instruction trap to VS-mode

   - MMU improvements related to upcoming nested virtualization

  s390x

   - Fixes

  x86:

   - Add CONFIG_KVM_IOAPIC for x86 to allow disabling support for I/O
     APIC, PIC, and PIT emulation at compile time

   - Share device posted IRQ code between SVM and VMX and harden it
     against bugs and runtime errors

   - Use vcpu_idx, not vcpu_id, for GA log tag/metadata, to make lookups
     O(1) instead of O(n)

   - For MMIO stale data mitigation, track whether or not a vCPU has
     access to (host) MMIO based on whether the page tables have MMIO
     pfns mapped; using VFIO is prone to false negatives

   - Rework the MSR interception code so that the SVM and VMX APIs are
     more or less identical

   - Recalculate all MSR intercepts from scratch on MSR filter changes,
     instead of maintaining shadow bitmaps

   - Advertise support for LKGS (Load Kernel GS base), a new instruction
     that's loosely related to FRED, but is supported and enumerated
     independently

   - Fix a user-triggerable WARN that syzkaller found by setting the
     vCPU in INIT_RECEIVED state (aka wait-for-SIPI), and then putting
     the vCPU into VMX Root Mode (post-VMXON). Trying to detect every
     possible path leading to architecturally forbidden states is hard
     and even risks breaking userspace (if it goes from valid to valid
     state but passes through invalid states), so just wait until
     KVM_RUN to detect that the vCPU state isn't allowed

   - Add KVM_X86_DISABLE_EXITS_APERFMPERF to allow disabling
     interception of APERF/MPERF reads, so that a "properly" configured
     VM can access APERF/MPERF. This has many caveats (APERF/MPERF
     cannot be zeroed on vCPU creation or saved/restored on suspend and
     resume, or preserved over thread migration let alone VM migration)
     but can be useful whenever you're interested in letting Linux
     guests see the effective physical CPU frequency in /proc/cpuinfo

   - Reject KVM_SET_TSC_KHZ for vm file descriptors if vCPUs have been
     created, as there's no known use case for changing the default
     frequency for other VM types and it goes counter to the very reason
     why the ioctl was added to the vm file descriptor. And also, there
     would be no way to make it work for confidential VMs with a
     "secure" TSC, so kill two birds with one stone

   - Dynamically allocation the shadow MMU's hashed page list, and defer
     allocating the hashed list until it's actually needed (the TDP MMU
     doesn't use the list)

   - Extract many of KVM's helpers for accessing architectural local
     APIC state to common x86 so that they can be shared by guest-side
     code for Secure AVIC

   - Various cleanups and fixes

  x86 (Intel):

   - Preserve the host's DEBUGCTL.FREEZE_IN_SMM when running the guest.
     Failure to honor FREEZE_IN_SMM can leak host state into guests

   - Explicitly check vmcs12.GUEST_DEBUGCTL on nested VM-Enter to
     prevent L1 from running L2 with features that KVM doesn't support,
     e.g. BTF

  x86 (AMD):

   - WARN and reject loading kvm-amd.ko instead of panicking the kernel
     if the nested SVM MSRPM offsets tracker can't handle an MSR (which
     is pretty much a static condition and therefore should never
     happen, but still)

   - Fix a variety of flaws and bugs in the AVIC device posted IRQ code

   - Inhibit AVIC if a vCPU's ID is too big (relative to what hardware
     supports) instead of rejecting vCPU creation

   - Extend enable_ipiv module param support to SVM, by simply leaving
     IsRunning clear in the vCPU's physical ID table entry

   - Disable IPI virtualization, via enable_ipiv, if the CPU is affected
     by erratum #1235, to allow (safely) enabling AVIC on such CPUs

   - Request GA Log interrupts if and only if the target vCPU is
     blocking, i.e. only if KVM needs a notification in order to wake
     the vCPU

   - Intercept SPEC_CTRL on AMD if the MSR shouldn't exist according to
     the vCPU's CPUID model

   - Accept any SNP policy that is accepted by the firmware with respect
     to SMT and single-socket restrictions. An incompatible policy
     doesn't put the kernel at risk in any way, so there's no reason for
     KVM to care

   - Drop a superfluous WBINVD (on all CPUs!) when destroying a VM and
     use WBNOINVD instead of WBINVD when possible for SEV cache
     maintenance

   - When reclaiming memory from an SEV guest, only do cache flushes on
     CPUs that have ever run a vCPU for the guest, i.e. don't flush the
     caches for CPUs that can't possibly have cache lines with dirty,
     encrypted data

  Generic:

   - Rework irqbypass to track/match producers and consumers via an
     xarray instead of a linked list. Using a linked list leads to
     O(n^2) insertion times, which is hugely problematic for use cases
     that create large numbers of VMs. Such use cases typically don't
     actually use irqbypass, but eliminating the pointless registration
     is a future problem to solve as it likely requires new uAPI

   - Track irqbypass's "token" as "struct eventfd_ctx *" instead of a
     "void *", to avoid making a simple concept unnecessarily difficult
     to understand

   - Decouple device posted IRQs from VFIO device assignment, as binding
     a VM to a VFIO group is not a requirement for enabling device
     posted IRQs

   - Clean up and document/comment the irqfd assignment code

   - Disallow binding multiple irqfds to an eventfd with a priority
     waiter, i.e. ensure an eventfd is bound to at most one irqfd
     through the entire host, and add a selftest to verify eventfd:irqfd
     bindings are globally unique

   - Add a tracepoint for KVM_SET_MEMORY_ATTRIBUTES to help debug issues
     related to private <=> shared memory conversions

   - Drop guest_memfd's .getattr() implementation as the VFS layer will
     call generic_fillattr() if inode_operations.getattr is NULL

   - Fix issues with dirty ring harvesting where KVM doesn't bound the
     processing of entries in any way, which allows userspace to keep
     KVM in a tight loop indefinitely

   - Kill off kvm_arch_{start,end}_assignment() and x86's associated
     tracking, now that KVM no longer uses assigned_device_count as a
     heuristic for either irqbypass usage or MDS mitigation

  Selftests:

   - Fix a comment typo

   - Verify KVM is loaded when getting any KVM module param so that
     attempting to run a selftest without kvm.ko loaded results in a
     SKIP message about KVM not being loaded/enabled (versus some random
     parameter not existing)

   - Skip tests that hit EACCES when attempting to access a file, and
     print a "Root required?" help message. In most cases, the test just
     needs to be run with elevated permissions"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (340 commits)
  Documentation: KVM: Use unordered list for pre-init VGIC registers
  RISC-V: KVM: Avoid re-acquiring memslot in kvm_riscv_gstage_map()
  RISC-V: KVM: Use find_vma_intersection() to search for intersecting VMAs
  RISC-V: perf/kvm: Add reporting of interrupt events
  RISC-V: KVM: Enable ring-based dirty memory tracking
  RISC-V: KVM: Fix inclusion of Smnpm in the guest ISA bitmap
  RISC-V: KVM: Delegate illegal instruction fault to VS mode
  RISC-V: KVM: Pass VMID as parameter to kvm_riscv_hfence_xyz() APIs
  RISC-V: KVM: Factor-out g-stage page table management
  RISC-V: KVM: Add vmid field to struct kvm_riscv_hfence
  RISC-V: KVM: Introduce struct kvm_gstage_mapping
  RISC-V: KVM: Factor-out MMU related declarations into separate headers
  RISC-V: KVM: Use ncsr_xyz() in kvm_riscv_vcpu_trap_redirect()
  RISC-V: KVM: Implement kvm_arch_flush_remote_tlbs_range()
  RISC-V: KVM: Don't flush TLB when PTE is unchanged
  RISC-V: KVM: Replace KVM_REQ_HFENCE_GVMA_VMID_ALL with KVM_REQ_TLB_FLUSH
  RISC-V: KVM: Rename and move kvm_riscv_local_tlb_sanitize()
  RISC-V: KVM: Drop the return value of kvm_riscv_vcpu_aia_init()
  RISC-V: KVM: Check kvm_riscv_vcpu_alloc_vector_context() return value
  KVM: arm64: selftests: Add FEAT_RAS EL2 registers to get-reg-list
  ...
2025-07-30 17:14:01 -07:00
Linus Torvalds
53564f4005 Merge tag 'iommu-updates-v6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux
Pull iommu updates from Will Deacon:
 "Core:
   - Remove the 'pgsize_bitmap' member from 'struct iommu_ops'
   - Convert the x86 drivers over to msi_create_parent_irq_domain()

  AMD-Vi:
   - Add support for examining driver/device internals via debugfs
   - Add support for "HATDis" to disable host translation when it is not
     supported
   - Add support for limiting the maximum host translation level based
     on EFR[HATS]

  Apple DART:
   - Don't enable as built-in by default when ARCH_APPLE is selected

  Arm SMMU:
   - Devicetree bindings update for the Qualcomm SMMU in the "Milos" SoC
   - Support for Qualcomm SM6115 MDSS parts
   - Disable PRR on Qualcomm SM8250 as using these bits causes the
     hypervisor to explode

  Intel VT-d:
   - Reorganize Intel VT-d to be ready for iommupt
   - Optimize iotlb_sync_map for non-caching/non-RWBF modes
   - Fix missed PASID in dev TLB invalidation in cache_tag_flush_all()

  Mediatek:
   - Fix build warnings when W=1

  Samsung Exynos:
   - Add support for reserved memory regions specified by the bootloader

  TI OMAP:
   - Use syscon_regmap_lookup_by_phandle_args() instead of parsing the
     node manually

  Misc:
   - Cleanups and minor fixes across the board"

* tag 'iommu-updates-v6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux: (48 commits)
  iommu/vt-d: Fix UAF on sva unbind with pending IOPFs
  iommu/vt-d: Make iotlb_sync_map a static property of dmar_domain
  dt-bindings: arm-smmu: Remove sdm845-cheza specific entry
  iommu/amd: Fix geometry.aperture_end for V2 tables
  iommu/amd: Wrap debugfs ABI testing symbols snippets in literal code blocks
  iommu/amd: Add documentation for AMD IOMMU debugfs support
  iommu/amd: Add debugfs support to dump IRT Table
  iommu/amd: Add debugfs support to dump device table
  iommu/amd: Add support for device id user input
  iommu/amd: Add debugfs support to dump IOMMU command buffer
  iommu/amd: Add debugfs support to dump IOMMU Capability registers
  iommu/amd: Add debugfs support to dump IOMMU MMIO registers
  iommu/amd: Refactor AMD IOMMU debugfs initial setup
  dt-bindings: arm-smmu: document the support on Milos
  iommu/exynos: add support for reserved regions
  iommu/arm-smmu: disable PRR on SM8250
  iommu/arm-smmu-v3: Revert vmaster in the error path
  iommu/io-pgtable-arm: Remove unused macro iopte_prot
  iommu/arm-smmu-qcom: Add SM6115 MDSS compatible
  iommu/qcom: Fix pgsize_bitmap
  ...
2025-07-30 10:42:00 -07:00
Paolo Bonzini
f02b1bcc73 Merge tag 'kvm-x86-irqs-6.17' of https://github.com/kvm-x86/linux into HEAD
KVM IRQ changes for 6.17

 - Rework irqbypass to track/match producers and consumers via an xarray
   instead of a linked list.  Using a linked list leads to O(n^2) insertion
   times, which is hugely problematic for use cases that create large numbers
   of VMs.  Such use cases typically don't actually use irqbypass, but
   eliminating the pointless registration is a future problem to solve as it
   likely requires new uAPI.

 - Track irqbypass's "token" as "struct eventfd_ctx *" instead of a "void *",
   to avoid making a simple concept unnecessarily difficult to understand.

 - Add CONFIG_KVM_IOAPIC for x86 to allow disabling support for I/O APIC, PIC,
   and PIT emulation at compile time.

 - Drop x86's irq_comm.c, and move a pile of IRQ related code into irq.c.

 - Fix a variety of flaws and bugs in the AVIC device posted IRQ code.

 - Inhibited AVIC if a vCPU's ID is too big (relative to what hardware
   supports) instead of rejecting vCPU creation.

 - Extend enable_ipiv module param support to SVM, by simply leaving IsRunning
   clear in the vCPU's physical ID table entry.

 - Disable IPI virtualization, via enable_ipiv, if the CPU is affected by
   erratum #1235, to allow (safely) enabling AVIC on such CPUs.

 - Dedup x86's device posted IRQ code, as the vast majority of functionality
   can be shared verbatime between SVM and VMX.

 - Harden the device posted IRQ code against bugs and runtime errors.

 - Use vcpu_idx, not vcpu_id, for GA log tag/metadata, to make lookups O(1)
   instead of O(n).

 - Generate GA Log interrupts if and only if the target vCPU is blocking, i.e.
   only if KVM needs a notification in order to wake the vCPU.

 - Decouple device posted IRQs from VFIO device assignment, as binding a VM to
   a VFIO group is not a requirement for enabling device posted IRQs.

 - Clean up and document/comment the irqfd assignment code.

 - Disallow binding multiple irqfds to an eventfd with a priority waiter, i.e.
   ensure an eventfd is bound to at most one irqfd through the entire host,
   and add a selftest to verify eventfd:irqfd bindings are globally unique.
2025-07-29 08:35:46 -04:00
Will Deacon
9f341a2aeb Merge branch 'intel/vt-d' into next
* intel/vt-d:
  iommu/vt-d: Fix UAF on sva unbind with pending IOPFs
  iommu/vt-d: Make iotlb_sync_map a static property of dmar_domain
  iommu/vt-d: Deduplicate cache_tag_flush_all by reusing flush_range
  iommu/vt-d: Fix missing PASID in dev TLB flush with cache_tag_flush_all
  iommu/vt-d: Split paging_domain_compatible()
  iommu/vt-d: Split intel_iommu_enforce_cache_coherency()
  iommu/vt-d: Create unique domain ops for each stage
  iommu/vt-d: Split intel_iommu_domain_alloc_paging_flags()
  iommu/vt-d: Do not wipe out the page table NID when devices detach
  iommu/vt-d: Fold domain_exit() into intel_iommu_domain_free()
  iommu/vt-d: Lift the __pa to domain_setup_first_level/intel_svm_set_dev_pasid()
  iommu/vt-d: Optimize iotlb_sync_map for non-caching/non-RWBF modes
  iommu/vt-d: Remove the CONFIG_X86 wrapping from iommu init hook
2025-07-24 11:17:52 +01:00
Lu Baolu
f0b9d31c6e iommu/vt-d: Fix UAF on sva unbind with pending IOPFs
Commit 17fce9d233 ("iommu/vt-d: Put iopf enablement in domain attach
path") disables IOPF on device by removing the device from its IOMMU's
IOPF queue when the last IOPF-capable domain is detached from the device.
Unfortunately, it did this in a wrong place where there are still pending
IOPFs. As a result, a use-after-free error is potentially triggered and
eventually a kernel panic with a kernel trace similar to the following:

 refcount_t: underflow; use-after-free.
 WARNING: CPU: 3 PID: 313 at lib/refcount.c:28 refcount_warn_saturate+0xd8/0xe0
 Workqueue: iopf_queue/dmar0-iopfq iommu_sva_handle_iopf
 Call Trace:
   <TASK>
   iopf_free_group+0xe/0x20
   process_one_work+0x197/0x3d0
   worker_thread+0x23a/0x350
   ? rescuer_thread+0x4a0/0x4a0
   kthread+0xf8/0x230
   ? finish_task_switch.isra.0+0x81/0x260
   ? kthreads_online_cpu+0x110/0x110
   ? kthreads_online_cpu+0x110/0x110
   ret_from_fork+0x13b/0x170
   ? kthreads_online_cpu+0x110/0x110
   ret_from_fork_asm+0x11/0x20
   </TASK>
  ---[ end trace 0000000000000000 ]---

The intel_pasid_tear_down_entry() function is responsible for blocking
hardware from generating new page faults and flushing all in-flight
ones. Therefore, moving iopf_for_domain_remove() after this function
should resolve this.

Fixes: 17fce9d233 ("iommu/vt-d: Put iopf enablement in domain attach path")
Reported-by: Ethan Milon <ethan.milon@eviden.com>
Closes: https://lore.kernel.org/r/e8b37f3e-8539-40d4-8993-43a1f3ffe5aa@eviden.com
Suggested-by: Ethan Milon <ethan.milon@eviden.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20250723072045.1853328-1-baolu.lu@linux.intel.com
Signed-off-by: Will Deacon <will@kernel.org>
2025-07-23 13:01:56 +01:00
Lu Baolu
cee686775f iommu/vt-d: Make iotlb_sync_map a static property of dmar_domain
Commit 12724ce3fe ("iommu/vt-d: Optimize iotlb_sync_map for
non-caching/non-RWBF modes") dynamically set iotlb_sync_map. This causes
synchronization issues due to lack of locking on map and attach paths,
racing iommufd userspace operations.

Invalidation changes must precede device attachment to ensure all flushes
complete before hardware walks page tables, preventing coherence issues.

Make domain->iotlb_sync_map static, set once during domain allocation. If
an IOMMU requires iotlb_sync_map but the domain lacks it, attach is
rejected. This won't reduce domain sharing: RWBF and shadowing page table
caching are legacy uses with legacy hardware. Mixed configs (some IOMMUs
in caching mode, others not) are unlikely in real-world scenarios.

Fixes: 12724ce3fe ("iommu/vt-d: Optimize iotlb_sync_map for non-caching/non-RWBF modes")
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20250721051657.1695788-1-baolu.lu@linux.intel.com
Signed-off-by: Will Deacon <will@kernel.org>
2025-07-21 14:25:46 +01:00
Mario Limonciello
75952c4975 iommu/vt-d: Use pci_is_display()
The inline pci_is_display() helper does the same thing.  Use it.

Suggested-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Daniel Dadap <ddadap@nvidia.com>
Reviewed-by: Simona Vetter <simona.vetter@ffwll.ch>
Link: https://patch.msgid.link/20250717173812.3633478-5-superm1@kernel.org
2025-07-17 15:30:13 -05:00
Ethan Milon
e934464e09 iommu/vt-d: Deduplicate cache_tag_flush_all by reusing flush_range
The logic in cache_tag_flush_all() to iterate over cache tags and issue
TLB invalidations is largely duplicated in cache_tag_flush_range(), with
the only difference being the range parameters.

Extend cache_tag_flush_range() to handle a full address space flush when
called with start = 0 and end = ULONG_MAX. This allows
cache_tag_flush_all() to simply delegate to cache_tag_flush_range()

Signed-off-by: Ethan Milon <ethan.milon@eviden.com>
Link: https://lore.kernel.org/r/20250708214821.30967-2-ethan.milon@eviden.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20250714045028.958850-12-baolu.lu@linux.intel.com
Signed-off-by: Will Deacon <will@kernel.org>
2025-07-14 11:18:04 +01:00
Ethan Milon
3141153816 iommu/vt-d: Fix missing PASID in dev TLB flush with cache_tag_flush_all
The function cache_tag_flush_all() was originally implemented with
incorrect device TLB invalidation logic that does not handle PASID, in
commit c4d27ffaa8 ("iommu/vt-d: Add cache tag invalidation helpers")

This causes regressions where full address space TLB invalidations occur
with a PASID attached, such as during transparent hugepage unmapping in
SVA configurations or when calling iommu_flush_iotlb_all(). In these
cases, the device receives a TLB invalidation that lacks PASID.

This incorrect logic was later extracted into
cache_tag_flush_devtlb_all(), in commit 3297d047cd ("iommu/vt-d:
Refactor IOTLB and Dev-IOTLB flush for batching")

The fix replaces the call to cache_tag_flush_devtlb_all() with
cache_tag_flush_devtlb_psi(), which properly handles PASID.

Fixes: 4f609dbff5 ("iommu/vt-d: Use cache helpers in arch_invalidate_secondary_tlbs")
Fixes: 4e589a5368 ("iommu/vt-d: Use cache_tag_flush_all() in flush_iotlb_all")
Signed-off-by: Ethan Milon <ethan.milon@eviden.com>
Link: https://lore.kernel.org/r/20250708214821.30967-1-ethan.milon@eviden.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20250714045028.958850-11-baolu.lu@linux.intel.com
Signed-off-by: Will Deacon <will@kernel.org>
2025-07-14 11:18:04 +01:00
Jason Gunthorpe
85cfaacc99 iommu/vt-d: Split paging_domain_compatible()
Make First/Second stage specific functions that follow the same pattern in
intel_iommu_domain_alloc_first/second_stage() for computing
EOPNOTSUPP. This makes the code easier to understand as if we couldn't
create a domain with the parameters for this IOMMU instance then we
certainly are not compatible with it.

Check superpage support directly against the per-stage cap bits and the
pgsize_bitmap.

Add a note that the force_snooping is read without locking. The locking
needs to cover the compatible check and the add of the device to the list.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/7-v3-dbbe6f7e7ae3+124ffe-vtd_prep_jgg@nvidia.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20250714045028.958850-10-baolu.lu@linux.intel.com
Signed-off-by: Will Deacon <will@kernel.org>
2025-07-14 11:18:04 +01:00
Jason Gunthorpe
0fa6f08934 iommu/vt-d: Split intel_iommu_enforce_cache_coherency()
First Stage and Second Stage have very different ways to deny
no-snoop. The first stage uses the PGSNP bit which is global per-PASID so
enabling requires loading new PASID entries for all the attached devices.

Second stage uses a bit per PTE, so enabling just requires telling future
maps to set the bit.

Since we now have two domain ops we can have two functions that can
directly code their required actions instead of a bunch of logic dancing
around use_first_level.

Combine domain_set_force_snooping() into the new functions since they are
the only caller.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/6-v3-dbbe6f7e7ae3+124ffe-vtd_prep_jgg@nvidia.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20250714045028.958850-9-baolu.lu@linux.intel.com
Signed-off-by: Will Deacon <will@kernel.org>
2025-07-14 11:18:04 +01:00
Jason Gunthorpe
b33125296b iommu/vt-d: Create unique domain ops for each stage
Use the domain ops pointer to tell what kind of domain it is instead of
the internal use_first_level indication. This also protects against
wrongly using a SVA/nested/IDENTITY/BLOCKED domain type in places they
should not be.

The only remaining uses of use_first_level outside the paging domain are in
paging_domain_compatible() and intel_iommu_enforce_cache_coherency().

Thus, remove the useless sets of use_first_level in
intel_svm_domain_alloc() and intel_iommu_domain_alloc_nested(). None of
the unique ops for these domain types ever reference it on their call
chains.

Add a WARN_ON() check in domain_context_mapping_one() as it only works
with second stage.

This is preparation for iommupt which will have different ops for each of
the stages.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/5-v3-dbbe6f7e7ae3+124ffe-vtd_prep_jgg@nvidia.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20250714045028.958850-8-baolu.lu@linux.intel.com
Signed-off-by: Will Deacon <will@kernel.org>
2025-07-14 11:18:03 +01:00
Jason Gunthorpe
b9434ba97c iommu/vt-d: Split intel_iommu_domain_alloc_paging_flags()
Create stage specific functions that check the stage specific conditions
if each stage can be supported.

Have intel_iommu_domain_alloc_paging_flags() call both stages in sequence
until one does not return EOPNOTSUPP and prefer to use the first stage if
available and suitable for the requested flags.

Move second stage only operations like nested_parent and dirty_tracking
into the second stage function for clarity.

Move initialization of the iommu_domain members into paging_domain_alloc().

Drop initialization of domain->owner as the callers all do it.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/4-v3-dbbe6f7e7ae3+124ffe-vtd_prep_jgg@nvidia.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20250714045028.958850-7-baolu.lu@linux.intel.com
Signed-off-by: Will Deacon <will@kernel.org>
2025-07-14 11:18:03 +01:00
Jason Gunthorpe
5c3687d578 iommu/vt-d: Do not wipe out the page table NID when devices detach
The NID is used to control which NUMA node memory for the page table is
allocated it from. It should be a permanent property of the page table
when it was allocated and not change during attach/detach of devices.

Reviewed-by: Wei Wang <wei.w.wang@intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/3-v3-dbbe6f7e7ae3+124ffe-vtd_prep_jgg@nvidia.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Fixes: 7c204426b8 ("iommu/vt-d: Add domain_alloc_paging support")
Link: https://lore.kernel.org/r/20250714045028.958850-6-baolu.lu@linux.intel.com
Signed-off-by: Will Deacon <will@kernel.org>
2025-07-14 11:18:03 +01:00
Jason Gunthorpe
00939bebe5 iommu/vt-d: Fold domain_exit() into intel_iommu_domain_free()
It has only one caller, no need for two functions.

Correct the WARN_ON() error handling to leak the entire page table if the
HW is still referencing it so we don't UAF during WARN_ON recovery.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/2-v3-dbbe6f7e7ae3+124ffe-vtd_prep_jgg@nvidia.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20250714045028.958850-5-baolu.lu@linux.intel.com
Signed-off-by: Will Deacon <will@kernel.org>
2025-07-14 11:18:03 +01:00
Jason Gunthorpe
cd0d0e4e48 iommu/vt-d: Lift the __pa to domain_setup_first_level/intel_svm_set_dev_pasid()
Pass the phys_addr_t down through the call chain from the top instead of
passing a pgd_t * KVA. This moves the __pa() into
domain_setup_first_level() which is the first function to obtain the pgd
from the IOMMU page table in this call chain.

The SVA flow is also adjusted to get the pa of the mm->pgd.

iommput will move the __pa() into iommupt code, it never shares the KVA of
the page table with the driver.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/1-v3-dbbe6f7e7ae3+124ffe-vtd_prep_jgg@nvidia.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20250714045028.958850-4-baolu.lu@linux.intel.com
Signed-off-by: Will Deacon <will@kernel.org>
2025-07-14 11:18:03 +01:00
Lu Baolu
12724ce3fe iommu/vt-d: Optimize iotlb_sync_map for non-caching/non-RWBF modes
The iotlb_sync_map iommu ops allows drivers to perform necessary cache
flushes when new mappings are established. For the Intel iommu driver,
this callback specifically serves two purposes:

- To flush caches when a second-stage page table is attached to a device
  whose iommu is operating in caching mode (CAP_REG.CM==1).
- To explicitly flush internal write buffers to ensure updates to memory-
  resident remapping structures are visible to hardware (CAP_REG.RWBF==1).

However, in scenarios where neither caching mode nor the RWBF flag is
active, the cache_tag_flush_range_np() helper, which is called in the
iotlb_sync_map path, effectively becomes a no-op.

Despite being a no-op, cache_tag_flush_range_np() involves iterating
through all cache tags of the iommu's attached to the domain, protected
by a spinlock. This unnecessary execution path introduces overhead,
leading to a measurable I/O performance regression. On systems with NVMes
under the same bridge, performance was observed to drop from approximately
~6150 MiB/s down to ~4985 MiB/s.

Introduce a flag in the dmar_domain structure. This flag will only be set
when iotlb_sync_map is required (i.e., when CM or RWBF is set). The
cache_tag_flush_range_np() is called only for domains where this flag is
set. This flag, once set, is immutable, given that there won't be mixed
configurations in real-world scenarios where some IOMMUs in a system
operate in caching mode while others do not. Theoretically, the
immutability of this flag does not impact functionality.

Reported-by: Ioanna Alifieraki <ioanna-maria.alifieraki@canonical.com>
Closes: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2115738
Link: https://lore.kernel.org/r/20250701171154.52435-1-ioanna-maria.alifieraki@canonical.com
Fixes: 129dab6e12 ("iommu/vt-d: Use cache_tag_flush_range_np() in iotlb_sync_map")
Cc: stable@vger.kernel.org
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20250703031545.3378602-1-baolu.lu@linux.intel.com
Link: https://lore.kernel.org/r/20250714045028.958850-3-baolu.lu@linux.intel.com
Signed-off-by: Will Deacon <will@kernel.org>
2025-07-14 11:18:03 +01:00
Vineeth Pillai (Google)
bd26cd9d81 iommu/vt-d: Remove the CONFIG_X86 wrapping from iommu init hook
iommu init hook is wrapped in CONFI_X86 and is a remnant of dmar.c when
it was a common code in "drivers/pci/dmar.c". This was added in commit
(9d5ce73a64 x86: intel-iommu: Convert detect_intel_iommu to use
iommu_init hook)

Now this is built only for x86. This config wrap could be removed.

Signed-off-by: Vineeth Pillai (Google) <vineeth@bitbyteword.org>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20250616131740.3499289-1-vineeth@bitbyteword.org
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20250714045028.958850-2-baolu.lu@linux.intel.com
Signed-off-by: Will Deacon <will@kernel.org>
2025-07-14 11:18:03 +01:00
Nicolin Chen
62622a8753 iommu: Allow an input type in hw_info op
The hw_info uAPI will support a bidirectional data_type field that can be
used as an input field for user space to request for a specific info data.

To prepare for the uAPI update, change the iommu layer first:
 - Add a new IOMMU_HW_INFO_TYPE_DEFAULT as an input, for which driver can
   output its only (or firstly) supported type
 - Update the kdoc accordingly
 - Roll out the type validation in the existing drivers

Link: https://patch.msgid.link/r/00f4a2d3d930721f61367014717b3ba2d1e82a81.1752126748.git.nicolinc@nvidia.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Pranjal Shrivastava <praan@google.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-07-11 14:34:35 -03:00
Nicolin Chen
4b57c057f9 iommu: Use enum iommu_hw_info_type for type in hw_info op
Replace u32 to make it clear. No functional changes.

Also simplify the kdoc since the type itself is clear enough.

Link: https://patch.msgid.link/r/651c50dee8ab900f691202ef0204cd5a43fdd6a2.1752126748.git.nicolinc@nvidia.com
Reviewed-by: Pranjal Shrivastava <praan@google.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-07-10 12:38:50 -03:00
Lu Baolu
25b1b75bba iommu/vt-d: Assign devtlb cache tag on ATS enablement
Commit <4f1492efb495> ("iommu/vt-d: Revert ATS timing change to fix boot
failure") placed the enabling of ATS in the probe_finalize callback. This
occurs after the default domain attachment, which is when the ATS cache
tag is assigned. Consequently, the device TLB cache tag is missed when the
domain is attached, leading to the device TLB not being invalidated in the
iommu_unmap paths.

Fix this by assigning the CACHE_TAG_DEVTLB cache tag when ATS is enabled.

Fixes: 4f1492efb4 ("iommu/vt-d: Revert ATS timing change to fix boot failure")
Cc: stable@vger.kernel.org
Suggested-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Tested-by: Shuicheng Lin <shuicheng.lin@intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20250625050135.3129955-1-baolu.lu@linux.intel.com
Link: https://lore.kernel.org/r/20250628100351.3198955-2-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-07-04 10:33:56 +02:00
Marc Zyngier
8154f3c0fd iommu/intel: Convert to msi_create_parent_irq_domain() helper
Now that we have a concise helper to create an MSI parent domain,
switch the Intel IOMMU remapping over to that.

Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Nam Cao <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20241204124549.607054-10-maz@kernel.org
Link: https://lore.kernel.org/r/169c793c50be8493cfd9d11affb00e9ed6341c36.1750858125.git.namcao@linutronix.de
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-07-04 10:28:11 +02:00
Jason Gunthorpe
8901812485 iommu: Remove ops.pgsize_bitmap from drivers that don't use it
These drivers all set the domain->pgsize_bitmap in their
domain_alloc_paging() functions, so the ops value is never used. Delete
it.

Reviewed-by: Sven Peter <sven@svenpeter.dev> # for Apple DART
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Tomasz Jeznach <tjeznach@rivosinc.com> # for RISC-V
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Link: https://lore.kernel.org/r/3-v2-68a2e1ba507c+1fb-iommu_rm_ops_pgsize_jgg@nvidia.com
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-06-27 08:55:56 +02:00
Sean Christopherson
53527ea1b7 iommu: KVM: Split "struct vcpu_data" into separate AMD vs. Intel structs
Split the vcpu_data structure that serves as a handoff from KVM to IOMMU
drivers into vendor specific structures.  Overloading a single structure
makes the code hard to read and maintain, is *very* misleading as it
suggests that mixing vendors is actually supported, and bastardizing
Intel's posted interrupt descriptor address when AMD's IOMMU already has
its own structure is quite unnecessary.

Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Link: https://lore.kernel.org/r/20250611224604.313496-33-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23 09:50:31 -07:00
Joerg Roedel
879b141b7c Merge branches 'fixes', 'apple/dart', 'arm/smmu/updates', 'arm/smmu/bindings', 'fsl/pamu', 'mediatek', 'renesas/ipmmu', 's390', 'intel/vt-d', 'amd/amd-vi' and 'core' into next 2025-05-23 17:14:32 +02:00
Lu Baolu
320302baed iommu/vt-d: Restore context entry setup order for aliased devices
Commit 2031c469f8 ("iommu/vt-d: Add support for static identity domain")
changed the context entry setup during domain attachment from a
set-and-check policy to a clear-and-reset approach. This inadvertently
introduced a regression affecting PCI aliased devices behind PCIe-to-PCI
bridges.

Specifically, keyboard and touchpad stopped working on several Apple
Macbooks with below messages:

 kernel: platform pxa2xx-spi.3: Adding to iommu group 20
 kernel: input: Apple SPI Keyboard as
 /devices/pci0000:00/0000:00:1e.3/pxa2xx-spi.3/spi_master/spi2/spi-APP000D:00/input/input0
 kernel: DMAR: DRHD: handling fault status reg 3
 kernel: DMAR: [DMA Read NO_PASID] Request device [00:1e.3] fault addr
 0xffffa000 [fault reason 0x06] PTE Read access is not set
 kernel: DMAR: DRHD: handling fault status reg 3
 kernel: DMAR: [DMA Read NO_PASID] Request device [00:1e.3] fault addr
 0xffffa000 [fault reason 0x06] PTE Read access is not set
 kernel: applespi spi-APP000D:00: Error writing to device: 01 0e 00 00
 kernel: DMAR: DRHD: handling fault status reg 3
 kernel: DMAR: [DMA Read NO_PASID] Request device [00:1e.3] fault addr
 0xffffa000 [fault reason 0x06] PTE Read access is not set
 kernel: DMAR: DRHD: handling fault status reg 3
 kernel: applespi spi-APP000D:00: Error writing to device: 01 0e 00 00

Fix this by restoring the previous context setup order.

Fixes: 2031c469f8 ("iommu/vt-d: Add support for static identity domain")
Closes: https://lore.kernel.org/all/4dada48a-c5dd-4c30-9c85-5b03b0aa01f0@bfh.ch/
Cc: stable@vger.kernel.org
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
Link: https://lore.kernel.org/r/20250514060523.2862195-1-baolu.lu@linux.intel.com
Link: https://lore.kernel.org/r/20250520075849.755012-2-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-05-23 17:11:40 +02:00