Commit Graph

118544 Commits

Author SHA1 Message Date
Matt Roper
4f3ecdb6ea drm/xe: Bypass Wa_14018094691 when primary GT is disabled
Don't try to lookup Wa_14018094691 on a NULL GT when the primary GT is
disabled.  Since this whole workaround centers around mid-thread
preemption behavior, the workaround isn't relevant if the primary GT
(where the engines that can do MTP live) is disabled.

Reviewed-by: Gustavo Sousa <gustavo.sousa@intel.com>
Link: https://lore.kernel.org/r/20251013200944.2499947-42-matthew.d.roper@intel.com
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
2025-10-14 07:45:17 -07:00
Matt Roper
794e735cb6 drm/xe/rtp: Pass xe_device parameter to FUNC matches
FUNC matches in RTP only pass the GT and hwe, preventing them from being
used effectively in device workarounds.  Add an additional xe_device
parameter so that we can use them in device workarounds where a GT may
not be available.

Reviewed-by: Gustavo Sousa <gustavo.sousa@intel.com>
Link: https://lore.kernel.org/r/20251013200944.2499947-41-matthew.d.roper@intel.com
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
2025-10-14 07:45:17 -07:00
Matt Roper
78de8f8766 drm/xe: Handle Wa_22010954014 and Wa_14022085890 as device workarounds
When Wa_22010954014 and Wa_14022085890 were first implemented, we didn't
have a device workaround infrastructure so we hacked them into the GT
workaround list.  Now that we have proper device workaround support,
move them to the proper place.  Note that Wa_14022085890 specifically
applies to BMG-G21 platforms, so this requires defining a BMG
subplatform to capture the correct subset of device IDs.

Reviewed-by: Gustavo Sousa <gustavo.sousa@intel.com>
Link: https://lore.kernel.org/r/20251013200944.2499947-40-matthew.d.roper@intel.com
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
2025-10-14 07:45:17 -07:00
Matt Roper
4d29240682 drm/xe/irq: Don't try to lookup engine masks for non-existent primary GT
If the primary GT is disabled via configfs, we shouldn't try to access
it to lookup BCS/CCS engine masks.  For the purposes of IRQ reset (which
masks & disables interrupts in an sgunit register), assume all possible
instances are present.

Reviewed-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
Link: https://lore.kernel.org/r/20251013200944.2499947-39-matthew.d.roper@intel.com
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
2025-10-14 07:45:17 -07:00
Matt Roper
886e5b6e5c drm/xe: Make display part of Wa_22019338487 a device workaround
The display part of Wa_22019338487 (i.e., avoiding use of stolen memory)
is using a platform test rather than an graphics/media IP test.  Since
this workaround is focused on non-GT uses of stolen memory, it makes
sense that we'd want to still apply the workaround on affected platforms
even if the GTs themselves are disabled via configfs.

Reviewed-by: Gustavo Sousa <gustavo.sousa@intel.com>
Link: https://lore.kernel.org/r/20251013200944.2499947-38-matthew.d.roper@intel.com
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
2025-10-14 07:45:16 -07:00
Matt Roper
d0ff153cca drm/xe: Check for primary GT before looking up Wa_22019338487
If the primary GT is disabled via configfs, we need to make sure that we
don't search for this workaround on a NULL xe_gt pointer.  Since we can
disable the primary GT only on igpu platforms, the media GT is the one
we'd want to check anyway for this workaround.

The ternary operators in ggtt_update_access_counter() were getting a bit
long/complicated, so rewrite them with regular if/else statements.
While we're at it, throw in a couple extra assertions to make sure that
we're truly picking the expected GT according to igpu/dgpu type.

v2:
 - Adjust indentation/wrapping; it's easier to read this with longer,
   unwrapped lines.  (Lucas)
 - Tweak wording of commit message to remove ambiguity.  (Gustavo)

Cc: Gustavo Sousa <gustavo.sousa@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Reviewed-by: Gustavo Sousa <gustavo.sousa@intel.com>
Link: https://lore.kernel.org/r/20251013200944.2499947-37-matthew.d.roper@intel.com
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
2025-10-14 07:44:58 -07:00
Matt Roper
999ef874c1 drm/xe/pmu: Initialize PMU event types based on first available GT
GT ID#0 (primary GT on tile 0) may not always be available if the
primary GT has been disabled via configfs.  Instead use the first
available GT when determining which PMU events are supported.  If there
are no GTs, then don't advertise any GT-related events.

Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://lore.kernel.org/r/20251013200944.2499947-36-matthew.d.roper@intel.com
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
2025-10-14 07:44:58 -07:00
Matt Roper
090e7fc422 drm/xe/query: Report hwconfig size as 0 if primary GT is disabled
The hwconfig table is part of the primary GT's GuC firmware.  If the
primary GT is disabled, the hwconfig is unavailable and should be
reported to userspace as having size 0.

Reviewed-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
Link: https://lore.kernel.org/r/20251013200944.2499947-35-matthew.d.roper@intel.com
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
2025-10-14 07:44:58 -07:00
Matt Roper
082547d8b4 drm/xe: Skip L2 / TDF cache flushes if primary GT is disabled
If the primary GT is disabled via configfs, GT-side L2 and TD cache
flushes are unnecessary since nothing is using/filling these caches.

Reviewed-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
Link: https://lore.kernel.org/r/20251013200944.2499947-34-matthew.d.roper@intel.com
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
2025-10-14 07:44:57 -07:00
Matt Roper
9c52402f6b drm/xe: Move primary GT allocation from xe_tile_init_early to xe_tile_init
During the early days of the Xe driver, there were cases where we
accessed some fields in the primary GT's xe_gt structure before the GT
itself was formally initialized; this required that the structure itself
be allocated during xe_tile_init_early().  A lot of refactoring of the
device probe has happened since that time and there's no longer a need
to allocate the primary GT early.  Move the allocation into
xe_info_init() where GT initialization happens and where we're doing the
allocation of the media GT.

v2:
 - Only make this change after a separate patch to perform VF GMD_ID
   lookup with a dummy GT instead of xe_root_mmio_gt().

Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://lore.kernel.org/r/20251013200944.2499947-33-matthew.d.roper@intel.com
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
2025-10-14 07:44:57 -07:00
Matt Roper
ff1d2b5e3d drm/xe: Read VF GMD_ID with a specifically-allocated dummy GT
SRIOV VF initialization has a bit of a chicken and egg design problem.
Determining the IP version of the graphics and media IPs can't be done
via direct register reads as it is on PF or native and instead requires
querying the GuC.  However initialization of the GT, including its GuC,
needs to wait until after we know the IP versions so that the proper
initialization steps for the platform/IP are followed.

Currently the (somewhat hacky) solution is to manually fill out just
enough fields in tile 0's primary GT structure to make it look as if the
GT has been initialized so that the GuC can be partially initialized and
queried to obtain the GMD_ID values.  When the GT gets properly
initialized during the regular flows, the hacked-up values will get
overwritten as part of the general initialization flows.

Rather than using tile 0's primary GT structure to hold the hacked up
values for querying every GT on every tile, instead allocate a dedicated
dummy structure.  This will allow us to move the tile->primary_gt's
allocation to a more consistent place later in the initialization flow
in future patches (i.e., we shouldn't even allocate this GT structure if
the GT is disabled/unavailable).  It also helps ensure there can't be
any accidental leakage of initialization or state between the dummy
initialization for GMD_ID and the real driver initialization of the GT.

v2:
 - Initialize gt->tile for temporary GT.  (CI, Michal)
 - Use scope-based cleanup handler to free temp GT.  (Michal)
 - Propagate actual error code from xe_gt_sriov_vf_bootstrap() rather
   than just setting IP version to 0.0 now that read_gmdid() can return
   an error.  (Michal)
v3:
 - Explicitly initialize gt to NULL, just in case something else gets
   inserted before the kzalloc() in the future.  (Lucas)

Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://lore.kernel.org/r/20251013200944.2499947-32-matthew.d.roper@intel.com
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
2025-10-14 07:44:57 -07:00
Matt Roper
1a28651c06 drm/xe: Move 'has_flatccs' flag back to platform descriptor
FlatCCS presence/absence is a flag that should be tracked at the
platform level rather than the IP level.  FlatCCS affects the
device-wide memory initialization and reservations so its effects are
not confined to a single IP block or GT.  This is also a trait that
should be tied to the platform even if the graphics IP itself is not
present (e.g., if we disable the primary GT via configfs).

Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://lore.kernel.org/r/20251013200944.2499947-31-matthew.d.roper@intel.com
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
2025-10-14 07:44:57 -07:00
Matt Roper
76b7aedd66 drm/xe: Move 'vram_flags' flag back to platform descriptor
Restrictions and requirements on VRAM alignment are something that
should be tracked at the platform level rather than the IP level.  Even
when mixing and matching various graphics, media, and display IP blocks,
the platform as a whole has to have consistent memory allocation
handling.  This is also a trait that should be tied to the platform even
if the graphics IP itself is not present (e.g., if we disable the
primary GT via configfs).

Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://lore.kernel.org/r/20251013200944.2499947-30-matthew.d.roper@intel.com
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
2025-10-14 07:44:57 -07:00
Matt Roper
50292f9af8 drm/xe: Move 'vm_max_level' flag back to platform descriptor
The number of page table levels for PPGTT virtual addresses is something
that should be tracked at the platform level rather than the IP level.
Even when mixing and matching various graphics, media, and display IP
blocks, the platform as a whole has to have consistent page table
handling.  This is also a trait that should be tied to the platform even
if the graphics IP itself is not present (e.g., if we disable the
primary GT via configfs).

v2:
 - Drop default value of 4 and explicitly set the value in each platform
   descriptor.  (Lucas)
v3:
 - Drop outdated code comment and commit message paragraph about default
   value.  (Gustavo)
v4:
 - Add missing setting for tgl_desc.  (Gustavo)

Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Cc: Gustavo Sousa <gustavo.sousa@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Reviewed-by: Gustavo Sousa <gustavo.sousa@intel.com>
Link: https://lore.kernel.org/r/20251013200944.2499947-29-matthew.d.roper@intel.com
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
2025-10-14 07:44:57 -07:00
Matt Roper
a3bcaf11f4 drm/xe: Move 'va_bits' flag back to platform descriptor
The number of virtual address bits is something that should be tracked
at the platform level rather than the IP level.  Even when mixing and
matching various graphics, media, and display IP blocks, the platform as
a whole has to have consistent page table handling.  This is also a
trait that should be tied to the platform even if the graphics IP itself
is not present (e.g., if we disable the primary GT via configfs).

v2:
 - Drop the default value of 48 and explicitly set it in each relevant
   descriptor.  (Lucas, Michal)
v3:
 - Drop an outdated comment about default value.  (Michal)

Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://lore.kernel.org/r/20251013200944.2499947-28-matthew.d.roper@intel.com
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
2025-10-14 07:44:57 -07:00
Matt Roper
d41f306cc2 drm/xe: Drop GT parameter to xe_display_irq_postinstall()
Display interrupt handling has no relation to GT(s) on the platforms
supported by the Xe driver.  We only call xe_display_irq_postinstall
with the first tile's primary GT, so the single condition that uses the
GT pointer within the function always evaluates to true.  Drop the
unnecessary parameter and the condition.

Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://lore.kernel.org/r/20251013200944.2499947-27-matthew.d.roper@intel.com
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
2025-10-14 07:44:57 -07:00
Matt Roper
2816905e14 drm/xe/huc: Adjust HuC check on primary GT
The HuC initialization code determines whether a platform can have a HuC
on the primary GT by checking whether tile->media_gt is NULL; old Xe1
platforms that combined render+media into a single GT will always have a
NULL media_gt pointer.  However once we allow media to be disabled via
configfs, there will also be cases where tile->media_gt is NULL on more
modern platforms, causing this condition to behave incorrectly.

To handle cases where media gets disabled via configfs (or theoretical
cases where media is truly fused off in hardware), change the condition
to consider the graphics version of the primary GT; only the old Xe1
platforms with graphics versions 12.55 or earlier should try to
initialize a HuC on the primary GT.

Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Reviewed-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
Link: https://lore.kernel.org/r/20251013200944.2499947-26-matthew.d.roper@intel.com
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
2025-10-14 07:44:57 -07:00
Matt Roper
89e347f8a7 drm/xe/kunit: Fix kerneldoc for parameterized tests
Kunit's generate_params() was recently updated to take an additional
test context parameter.  Xe's IP and platform parameter generators were
updated accordingly at the same time, but the new parameter was not
added to the functions' kerneldoc, resulting in the following warnings:

   Warning: drivers/gpu/drm/xe/tests/xe_pci.c:78 function parameter 'test' not described in 'xe_pci_fake_data_gen_params'
   Warning: drivers/gpu/drm/xe/tests/xe_pci.c:254 function parameter 'test' not described in 'xe_pci_graphics_ip_gen_param'
   Warning: drivers/gpu/drm/xe/tests/xe_pci.c:278 function parameter 'test' not described in 'xe_pci_media_ip_gen_param'
   Warning: drivers/gpu/drm/xe/tests/xe_pci.c:302 function parameter 'test' not described in 'xe_pci_id_gen_param'
   Warning: drivers/gpu/drm/xe/tests/xe_pci.c:390 function parameter 'test' not described in 'xe_pci_live_device_gen_param'
   5 warnings as errors

Document the new parameter to eliminate the warnings and make CI happy.

Fixes: b9a214b5f6 ("kunit: Pass parameterized test context to generate_params()")
Reviewed-by: Shuicheng Lin <shuicheng.lin@intel.com>
Link: https://lore.kernel.org/r/20251013153014.2362879-2-matthew.d.roper@intel.com
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
2025-10-14 07:37:26 -07:00
Thomas Hellström
2cfcea7a74 drm/xe/svm: Ensure data will be migrated to system if indicated by madvise.
If the location madvise() is set to
DRM_XE_PREFERRED_LOC_DEFAULT_SYSTEM, the drm_pagemap in the
SVM gpu fault handler will be set to NULL. However there is nothing
that explicitly migrates the data to system if it is already present
in device memory.

In that case, set the device memory owner to NULL to ensure
data gets properly migrated to system on page-fault.

v2:
- Remove redundant dpagemap assignment (Himal Prasad Ghimiray)

Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com> #v1
Reviewed-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Link: https://lore.kernel.org/r/20251010104149.72783-2-thomas.hellstrom@linux.intel.com
Fixes: 10aa5c8060 ("drm/gpusvm, drm/xe: Fix userptr to not allow device private pages")
2025-10-14 13:11:16 +02:00
Tomasz Lis
bb3d208250 drm/xe/ct: Separate waiting for retry from ct send function
The function `guc_ct_send_locked()` is really quite simple, but still
looks complex due to exposed internals. It is sending a message,
and in case of lack of space, waiting for a proper moment to send
a retry.

Clear separation of send function and wait function will help with
readability. This is a cosmetic change only, no functional difference
is expected.

This patch introduces `guc_ct_send_wait_for_retry()`, and uses it to
greatly simplify `guc_ct_send_locked()`.

Signed-off-by: Tomasz Lis <tomasz.lis@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Link: https://lore.kernel.org/r/20251009170844.178199-1-tomasz.lis@intel.com
2025-10-14 12:37:02 +02:00
Thomas Hellström
82ee50252d Merge drm/drm-next into drm-xe-next
Backmerging to bring in 6.18-rc1.

Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
2025-10-14 11:31:49 +02:00
Matthew Brost
dd83b101a4 drm/xe: Enable 2M pages in xe_migrate_vram
Using 2M pages in xe_migrate_vram has two benefits: we issue fewer
instructions per 2M copy (1 vs. 512), and the cache hit rate should be
higher. This results in increased copy engine bandwidth, as shown by
benchmark IGTs.

Enable 2M pages by reserving PDEs in the migrate VM and using 2M pages
in xe_migrate_vram if the DMA address order matches 2M.

v2:
 - Reuse build_pt_update_batch_sram (Stuart)
 - Fix build_pt_update_batch_sram for PAGE_SIZE > 4K
v3:
 - More fixes for PAGE_SIZE > 4K, align chunk, decrement chunk as needed
 - Use stack incr var in xe_migrate_vram_use_pde (Stuart)
v4:
 - Split PAGE_SIZE > 4K fix out in different patch (Stuart)

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Stuart Summers <stuart.summers@intel.com>
Link: https://lore.kernel.org/r/20251013034555.4121168-3-matthew.brost@intel.com
2025-10-13 12:31:22 -07:00
Matthew Brost
55991d854f drm/xe: Fix build_pt_update_batch_sram for non-4K PAGE_SIZE
The build_pt_update_batch_sram function in the Xe migrate layer assumes
PAGE_SIZE == XE_PAGE_SIZE (4K), which is not a valid assumption on
non-x86 platforms. This patch updates build_pt_update_batch_sram to
correctly handle PAGE_SIZE > 4K by programming multiple 4K GPU pages per
CPU page.

v5:
 - Mask off non-address bits during compare

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Tested-by: Simon Richter <Simon.Richter@hogyros.de>
Reviewed-by: Stuart Summers <stuart.summers@intel.com>
Link: https://lore.kernel.org/r/20251013034555.4121168-2-matthew.brost@intel.com
2025-10-13 12:31:21 -07:00
Shuicheng Lin
604be9dad8 drm/xe: Fix comments in xe_gt struct
Correct several spelling and grammar issues in xe_gt struct
documentation to improve readability:

- Fix "to not" -> "do not".
- Fix "mmigrations" -> "migrations".
- Fix "Multiple queues exists" -> "Multiple queues exist".
- Fix "are be processed" -> "to be processed".
- Fix "have being processed" -> "have been processed".

These changes are purely cosmetic and do not affect functionality.

v2: drop kernel-doc formatting change. (Jani)

Cc: Jani Nikula <jani.nikula@linux.intel.com>
Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Link: https://lore.kernel.org/r/20251006151317.2553182-2-shuicheng.lin@intel.com
2025-10-13 12:28:27 -07:00
Shuicheng Lin
9b42321a02 drm/xe/guc: Check GuC running state before deregistering exec queue
In normal operation, a registered exec queue is disabled and
deregistered through the GuC, and freed only after the GuC confirms
completion. However, if the driver is forced to unbind while the exec
queue is still running, the user may call exec_destroy() after the GuC
has already been stopped and CT communication disabled.

In this case, the driver cannot receive a response from the GuC,
preventing proper cleanup of exec queue resources. Fix this by directly
releasing the resources when GuC is not running.

Here is the failure dmesg log:
"
[  468.089581] ---[ end trace 0000000000000000 ]---
[  468.089608] pci 0000:03:00.0: [drm] *ERROR* GT0: GUC ID manager unclean (1/65535)
[  468.090558] pci 0000:03:00.0: [drm] GT0:     total 65535
[  468.090562] pci 0000:03:00.0: [drm] GT0:     used 1
[  468.090564] pci 0000:03:00.0: [drm] GT0:     range 1..1 (1)
[  468.092716] ------------[ cut here ]------------
[  468.092719] WARNING: CPU: 14 PID: 4775 at drivers/gpu/drm/xe/xe_ttm_vram_mgr.c:298 ttm_vram_mgr_fini+0xf8/0x130 [xe]
"

v2: use xe_uc_fw_is_running() instead of xe_guc_ct_enabled().
    As CT may go down and come back during VF migration.

Fixes: dd08ebf6c3 ("drm/xe: Introduce a new DRM driver for Intel GPUs")
Cc: stable@vger.kernel.org
Cc: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Link: https://lore.kernel.org/r/20251010172529.2967639-2-shuicheng.lin@intel.com
2025-10-13 10:38:47 -07:00
Raag Jadav
0bb78ce099 drm/xe/i2c: Wire up reset/postinstall for I2C IRQ
I2C IRQ needs to be routed to SGUnit or PUnit for the devices that support
it. Wire up reset/postinstall handles for I2C IRQ to take care of this.

Signed-off-by: Raag Jadav <raag.jadav@intel.com>
Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com>
Reviewed-by: Heikki Krogerus <heikki.krogerus@linux.intel.com>
Link: https://lore.kernel.org/r/20251011123509.3233213-3-raag.jadav@intel.com
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2025-10-13 07:24:25 -07:00
Raag Jadav
3df5aacb9d drm/xe/i2c: Introduce xe_i2c_irq_present()
In preparation of wider usecases which require checking for I2C IRQ
presence, introduce xe_i2c_irq_present() helper.

Signed-off-by: Raag Jadav <raag.jadav@intel.com>
Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com>
Reviewed-by: Heikki Krogerus <heikki.krogerus@linux.intel.com>
Link: https://lore.kernel.org/r/20251011123509.3233213-2-raag.jadav@intel.com
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2025-10-13 07:24:25 -07:00
Arun Abhishek Chowhan
64d00d41f5 drm/xe: Sort include files alphabetically.
Sort the include lines alphabetically, no impact on code behavior.

Signed-off-by: Arun Abhishek Chowhan <arun.abhishek.chowhan@intel.com>
Reviewed-by: Nitin Gote <nitin.r.gote@intel.com>
Link: https://lore.kernel.org/r/20251013095914.3742505-1-arun.abhishek.chowhan@intel.com
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2025-10-13 07:18:20 -07:00
Linus Torvalds
284fc30e66 Merge tag 'drm-next-2025-10-11-1' of https://gitlab.freedesktop.org/drm/kernel
Pull more drm fixes from Dave Airlie:
 "Just the follow up fixes for rc1 from the next branch, amdgpu and xe
  mostly with a single v3d fix in there.

  amdgpu:
   - DC DCE6 fixes
   - GPU reset fixes
   - Secure diplay messaging cleanup
   - MES fix
   - GPUVM locking fixes
   - PMFW messaging cleanup
   - PCI US/DS switch handling fix
   - VCN queue reset fix
   - DC FPU handling fix
   - DCN 3.5 fix
   - DC mirroring fix

  amdkfd:
   - Fix kfd process ref leak
   - mmap write lock handling fix
   - Fix comments in IOCTL

  xe:
   - Fix build with clang 16
   - Fix handling of invalid configfs syntax usage and spell out the
     expected syntax in the documentation
   - Do not try late bind firmware when running as VF since it shouldn't
     handle firmware loading
   - Fix idle assertion for local BOs
   - Fix uninitialized variable for late binding
   - Do not require perfmon_capable to expose free memory at page
     granularity. Handle it like other drm drivers do
   - Fix lock handling on suspend error path
   - Fix I2C controller resume after S3

  v3d:
   - fix fence locking"

* tag 'drm-next-2025-10-11-1' of https://gitlab.freedesktop.org/drm/kernel: (34 commits)
  drm/amd/display: Incorrect Mirror Cositing
  drm/amd/display: Enable Dynamic DTBCLK Switch
  drm/amdgpu: Report individual reset error
  drm/amdgpu: partially revert "revert to old status lock handling v3"
  drm/amd/display: Fix unsafe uses of kernel mode FPU
  drm/amd/pm: Disable VCN queue reset on SMU v13.0.6 due to regression
  drm/amdgpu: Fix general protection fault in amdgpu_vm_bo_reset_state_machine
  drm/amdgpu: Check swus/ds for switch state save
  drm/amdkfd: Fix two comments in kfd_ioctl.h
  drm/amd/pm: Avoid interface mismatch messaging
  drm/amdgpu: Merge amdgpu_vm_set_pasid into amdgpu_vm_init
  drm/amd/amdgpu: Fix the mes version that support inv_tlbs
  drm/amd: Check whether secure display TA loaded successfully
  drm/amdkfd: Fix mmap write lock not release
  drm/amdkfd: Fix kfd process ref leaking when userptr unmapping
  drm/amdgpu: Fix for GPU reset being blocked by KIQ I/O.
  drm/amd/display: Disable scaling on DCE6 for now
  drm/amd/display: Properly disable scaling on DCE6
  drm/amd/display: Properly clear SCL_*_FILTER_CONTROL on DCE6
  drm/amd/display: Add missing DCE6 SCL_HORZ_FILTER_INIT* SRIs
  ...
2025-10-10 14:02:14 -07:00
Linus Torvalds
1e5d41b981 Merge tag 'drm-fixes-2025-10-11' of https://gitlab.freedesktop.org/drm/kernel
Pull drm fixes from Dave Airlie:
 "Some fixes leftover from our fixes branch, just nouveau and vmwgfx:

  nouveau:
   - Return errno code from TTM move helper

  vmwgfx:
   - Fix null-ptr access in cursor code
   - Fix UAF in validation
   - Use correct iterator in validation"

* tag 'drm-fixes-2025-10-11' of https://gitlab.freedesktop.org/drm/kernel:
  drm/nouveau: fix bad ret code in nouveau_bo_move_prep
  drm/vmwgfx: Fix copy-paste typo in validation
  drm/vmwgfx: Fix Use-after-free in validation
  drm/vmwgfx: Fix a null-ptr access in the cursor snooper
2025-10-10 13:59:38 -07:00
Dave Airlie
5ca5f00a16 Merge tag 'drm-misc-fixes-2025-10-09' of https://gitlab.freedesktop.org/drm/misc/kernel into drm-fixes
Short summary of fixes pull:

nouveau:
- Return errno code from TTM move helper

vmwgfx:
- Fix null-ptr access in cursor code
- Fix UAF in validation
- Use correct iterator in validation

Signed-off-by: Dave Airlie <airlied@redhat.com>

From: Thomas Zimmermann <tzimmermann@suse.de>
Link: https://lore.kernel.org/r/20251009120004.GA17570@linux.fritz.box
2025-10-11 06:17:13 +10:00
Shuicheng Lin
65369b8e29 drm/xe: Change return type of detect_bar2_dgfx() from s64 to u64
The function never returns a negative value, and the return value is
assigned to a u64 variable. Use u64 for better type correctness
and clarity.

v2: add assert to catch theoretical negative or zero stolen size. (Rodrigo)

Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com>
Link: https://lore.kernel.org/r/20251009230239.2830207-8-shuicheng.lin@intel.com
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2025-10-10 15:10:17 -04:00
Shuicheng Lin
0145a99eac drm/xe: Fix copyright in xe_ttm_stolen_mgr
- Correct Red Hat copyright year from "2002" to "2022".
- Sort the include lines alphabetically.

No functional changes intended.

Reviewed-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com>
Link: https://lore.kernel.org/r/20251009230239.2830207-7-shuicheng.lin@intel.com
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2025-10-10 15:10:13 -04:00
Shuicheng Lin
f85d4062bc drm/xe: Fix copyright and function naming in xe_ttm_sys_mgr
- Correct Red Hat copyright year from "2002" to "2022".
- Rename ttm_sys_mgr_fini() to xe_ttm_sys_mgr_fini() to avoid
  confusion with generic TTM helpers.

No functional changes intended.

Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com>
Link: https://lore.kernel.org/r/20251009230239.2830207-6-shuicheng.lin@intel.com
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2025-10-10 15:10:13 -04:00
Vinay Belgaumkar
4cbc08649a drm/xe: Enable media sampler power gating
Where applicable, enable media sampler power gating. Also, add
it to the powergate_info debugfs.

v2: Remove the sampler powergate status since it is cleared quickly anyway.
v3: Use vcs mask (Rodrigo) and fix the version check for media
v4: Remove extra spaces
v5: Media samplers are independent of vcs mask,
    use Media version 1255 (Matt Roper)

Fixes: 38e8c4184e ("drm/xe: Enable Coarse Power Gating")
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Matt Roper <matthew.d.roper@intel.com>
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Vinay Belgaumkar <vinay.belgaumkar@intel.com>
Link: https://lore.kernel.org/r/20251010011047.2047584-1-vinay.belgaumkar@intel.com
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2025-10-10 15:07:54 -04:00
Matthew Brost
75188605c5 drm/xe: Handle mixed mappings and existing VRAM on atomic faults
Moving to VRAM will fail if mixed mappings are present or if the page is
already located in VRAM. Atomic faults that require a move to VRAM
currently retry without attempting to evict mixed mappings or locate
existing VRAM mappings.

This patch fixes the issue by attempting to evict mixed mappings or find
existing VRAM pages when a move to VRAM fails during atomic fault
handling.

Fixes: a9ac0fa455 ("drm/xe: Strict migration policy for atomic SVM faults")
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Link: https://lore.kernel.org/r/20251009130629.3531962-1-matthew.brost@intel.com
2025-10-10 08:18:11 -07:00
Thomas Hellström
381f1ed151 drm/xe/migrate: Fix an error path
The exhaustive eviction accidently changed an error path goto to
a return. Fix this.

Fixes: 59eabff2a3 ("drm/xe: Convert xe_bo_create_pin_map() for exhaustive eviction")
Cc: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Reviewed-by: Francois Dugast <francois.dugast@intel.com>
Link: https://lore.kernel.org/r/20250910160939.103473-1-thomas.hellstrom@linux.intel.com
2025-10-10 11:35:11 +02:00
Lucas De Marchi
45e33f220f drm/xe: Move rebar to be done earlier
There may be cases in which the BAR0 also needs to move to accommodate
the bigger BAR2. However if it's not released, the BAR2 resize fails.
During the vram probe it can't be released as it's already in use by
xe_mmio for early register access.

Add a new function in xe_vram and let xe_pci call it directly before
even early device probe. This allows the BAR2 to resize in cases BAR0
also needs to move, assuming there aren't other reasons to hold that
move:

	[] xe 0000:03:00.0: vgaarb: deactivate vga console
	[] xe 0000:03:00.0: [drm] Attempting to resize bar from 8192MiB -> 16384MiB
	[] xe 0000:03:00.0: BAR 0 [mem 0x83000000-0x83ffffff 64bit]: releasing
	[] xe 0000:03:00.0: BAR 2 [mem 0x4000000000-0x41ffffffff 64bit pref]: releasing
	[] pcieport 0000:02:01.0: bridge window [mem 0x4000000000-0x41ffffffff 64bit pref]: releasing
	[] pcieport 0000:01:00.0: bridge window [mem 0x4000000000-0x41ffffffff 64bit pref]: releasing
	[] pcieport 0000:01:00.0: bridge window [mem 0x4000000000-0x43ffffffff 64bit pref]: assigned
	[] pcieport 0000:02:01.0: bridge window [mem 0x4000000000-0x43ffffffff 64bit pref]: assigned
	[] xe 0000:03:00.0: BAR 2 [mem 0x4000000000-0x43ffffffff 64bit pref]: assigned
	[] xe 0000:03:00.0: BAR 0 [mem 0x83000000-0x83ffffff 64bit]: assigned
	[] pcieport 0000:00:01.0: PCI bridge to [bus 01-04]
	[] pcieport 0000:00:01.0:   bridge window [mem 0x83000000-0x840fffff]
	[] pcieport 0000:00:01.0:   bridge window [mem 0x4000000000-0x44007fffff 64bit pref]
	[] pcieport 0000:01:00.0: PCI bridge to [bus 02-04]
	[] pcieport 0000:01:00.0:   bridge window [mem 0x83000000-0x840fffff]
	[] pcieport 0000:01:00.0:   bridge window [mem 0x4000000000-0x43ffffffff 64bit pref]
	[] pcieport 0000:02:01.0: PCI bridge to [bus 03]
	[] pcieport 0000:02:01.0:   bridge window [mem 0x83000000-0x83ffffff]
	[] pcieport 0000:02:01.0:   bridge window [mem 0x4000000000-0x43ffffffff 64bit pref]
	[] xe 0000:03:00.0: [drm] BAR2 resized to 16384M
	[] xe 0000:03:00.0: [drm:xe_pci_probe [xe]] BATTLEMAGE  e221:0000 dgfx:1 gfx:Xe2_HPG (20.02) ...

For BMG there are additional fix needed in the PCI side, but this
helps getting it to a working resize.

All the rebar logic is more pci-specific than xe-specific and can be
done very early in the probe sequence. In future it would be good to
move it out of xe_vram.c, but this refactor is left for later.

Cc: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Cc: stable@vger.kernel.org # 6.12+
Link: https://lore.kernel.org/intel-xe/fafda2a3-fc63-ce97-d22b-803f771a4d19@linux.intel.com
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Link: https://lore.kernel.org/r/20250918-xe-pci-rebar-2-v1-2-6c094702a074@intel.com
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2025-10-09 22:09:31 -07:00
Dave Airlie
c4b6ddcf01 Merge tag 'amd-drm-next-6.18-2025-10-09' of https://gitlab.freedesktop.org/agd5f/linux into drm-next
amd-drm-next-6.18-2025-10-09:

amdgpu:
- DC DCE6 fixes
- GPU reset fixes
- Secure diplay messaging cleanup
- MES fix
- GPUVM locking fixes
- PMFW messaging cleanup
- PCI US/DS switch handling fix
- VCN queue reset fix
- DC FPU handling fix
- DCN 3.5 fix
- DC mirroring fix

amdkfd:
- Fix kfd process ref leak
- mmap write lock handling fix
- Fix comments in IOCTL

Signed-off-by: Dave Airlie <airlied@redhat.com>

From: Alex Deucher <alexander.deucher@amd.com>
Link: https://lore.kernel.org/r/20251009162915.981503-1-alexander.deucher@amd.com
2025-10-10 06:57:56 +10:00
Matthew Brost
8b9ba8d6d9 drm/xe: Don't allow evicting of BOs in same VM in array of VM binds
An array of VM binds can potentially evict other buffer objects (BOs)
within the same VM under certain conditions, which may lead to NULL
pointer dereferences later in the bind pipeline. To prevent this, clear
the allow_res_evict flag in the xe_bo_validate call.

v2:
 - Invert polarity of no_res_evict (Thomas)
 - Add comment in code explaining issue (Thomas)

Cc: stable@vger.kernel.org
Reported-by: Paulo Zanoni <paulo.r.zanoni@intel.com>
Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/6268
Fixes: 774b5fa509 ("drm/xe: Avoid evicting object of the same vm in none fault mode")
Fixes: 77f2ef3f16 ("drm/xe: Lock all gpuva ops during VM bind IOCTL")
Fixes: dd08ebf6c3 ("drm/xe: Introduce a new DRM driver for Intel GPUs")
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Tested-by: Paulo Zanoni <paulo.r.zanoni@intel.com>
Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Link: https://lore.kernel.org/r/20251009110618.3481870-1-matthew.brost@intel.com
2025-10-09 11:40:19 -07:00
Kenneth Graunke
146046907b drm/xe: Increase global invalidation timeout to 1000us
The previous timeout of 500us seems to be too small; panning the map in
the Roll20 VTT in Firefox on a KDE/Wayland desktop reliably triggered
timeouts within a few seconds of usage, causing the monitor to freeze
and the following to be printed to dmesg:

[Jul30 13:44] xe 0000:03:00.0: [drm] *ERROR* GT0: Global invalidation timeout
[Jul30 13:48] xe 0000:03:00.0: [drm] *ERROR* [CRTC:82:pipe A] flip_done timed out

I haven't hit a single timeout since increasing it to 1000us even after
several multi-hour testing sessions.

Fixes: 0dd2dd0182 ("drm/xe: Move DSB l2 flush to a more sensible place")
Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/5710
Signed-off-by: Kenneth Graunke <kenneth@whitecape.org>
Cc: stable@vger.kernel.org
Cc: Maarten Lankhorst <dev@lankhorst.se>
Reviewed-by: Shuicheng Lin <shuicheng.lin@intel.com>
Link: https://lore.kernel.org/r/20250912223254.147940-1-kenneth@whitecape.org
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2025-10-09 07:53:40 -07:00
Satyanarayana K V P
60e2667557 drm/xe/guc: Increase wait timeout to 2sec after BUSY reply from GuC
Some VF2GUC actions may take longer to process. Increase default timeout
after received BUSY indication to 2sec to cover all worst case scenarios.

Signed-off-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://lore.kernel.org/r/20251008214532.3442967-35-matthew.brost@intel.com
2025-10-09 03:24:28 -07:00
Matthew Brost
8f1e1e524c drm/xe/vf: Rebase CCS save/restore BB GGTT addresses
Rebase the CCS save/restore BB's GGTT addresses during VF post-migration
recovery by setting the software ring tail to zero, the LRC ring head to
zero, and rewriting the jump-to-BB instructions.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>
Link: https://lore.kernel.org/r/20251008214532.3442967-34-matthew.brost@intel.com
2025-10-09 03:24:28 -07:00
Matthew Brost
a093570ecd drm/xe/vf: Ensure media GT VF recovery runs after primary GT on PTL
It is possible that the media GT's VF post-migration recovery work item
gets scheduled before the primary GT's work item. Since the media GT
depends on the primary GT's work item to complete CCS restore, if the
media GT's work item is scheduled first, detect this condition and
re-queue the media GT's work item for a later time.

v5:
 - Adjust debug message (Tomasz)

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Tomasz Lis <tomasz.lis@intel.com>
Link: https://lore.kernel.org/r/20251008214532.3442967-33-matthew.brost@intel.com
2025-10-09 03:24:28 -07:00
Matthew Brost
efd38d619a drm/xe/vf: Use primary GT ordered work queue on media GT on PTL VF
VF CCS restore is a primary GT operation on which the media GT depends.
Therefore, it doesn't make much sense to run these operations in
parallel. To address this, point the media GT's ordered work queue to
the primary GT's ordered work queue on platforms that require (PTL VFs)
CCS restore as part of VF post-migration recovery.

v7:
 - Remove bool from xe_gt_alloc (Lucas)
v9:
 - Fix typo (Lucas)

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://lore.kernel.org/r/20251008214532.3442967-32-matthew.brost@intel.com
2025-10-09 03:24:28 -07:00
Satyanarayana K V P
673167d9f0 drm/xe: Use PPGTT addresses for TLB invalidation to avoid GGTT fixups
The migrate VM builds the CCS metadata save/restore batch buffer (BB) in
advance and retains it so the GuC can submit it directly when saving a
VM’s state.

When a VM migrates between VFs, the GGTT base can change. Any GGTT-based
addresses embedded in the BB would then have to be parsed and patched.

Use PPGTT addresses in the BB (including for TLB invalidation) so the BB
remains GGTT-agnostic and requires no address fixups during migration.

Signed-off-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://lore.kernel.org/r/20251008214532.3442967-31-matthew.brost@intel.com
2025-10-09 03:24:07 -07:00
Matthew Brost
3b56911960 drm/xe/vf: Workaround for race condition in GuC firmware during VF pause
A race condition exists where a paused VF's H2G request can be processed
and subsequently rejected. This rejection results in a FAST_REQ failure
being delivered to the KMD, which then terminates the CT via a dead
worker and triggers a GT reset—an undesirable outcome.

This workaround mitigates the issue by checking if a VF post-migration
recovery is in progress and aborting these adverse actions accordingly.
The GuC firmware will address this bug in an upcoming release. Once that
version is available and VF migration depends on it, this workaround can
be safely removed.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Tomasz Lis <tomasz.lis@intel.com>
Link: https://lore.kernel.org/r/20251008214532.3442967-30-matthew.brost@intel.com
2025-10-09 03:22:57 -07:00
Matthew Brost
1521fad9ad drm/xe/vf: Add debug prints for GuC replaying state during VF recovery
Helpful to manually verify the GuC state machine can correctly replay
the state during a VF post-migration recovery. All replay paths have
been manually verified as triggered and working during testing.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Tomasz Lis <tomasz.lis@intel.com>
Link: https://lore.kernel.org/r/20251008214532.3442967-29-matthew.brost@intel.com
2025-10-09 03:22:56 -07:00
Matthew Brost
3c1fa4aa60 drm/xe: Move queue init before LRC creation
A queue must be in the submission backend's tracking state before the
LRC is created to avoid a race condition where the LRC's GGTT addresses
are not properly fixed up during VF post-migration recovery.

Move the queue initialization—which adds the queue to the submission
backend's tracking state—before LRC creation.

Also wait on pending GGTT fixups before allocating LRCs to avoid racing
with fixups.

v2:
 - Wait on VF GGTT fixes before creating LRC (testing)
v5:
 - Adjust comment in code (Tomasz)
 - Reduce race window
v7:
 - Only wakeup waiters in recovery path (CI)
 - Wakeup waiters on abort
 - Use GT warn on (Michal)
 - Fix kernel doc for LRC ring size function (Tomasz)
v8:
 - Guard against migration not supported or no memirq (CI)

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Tomasz Lis <tomasz.lis@intel.com>
Link: https://lore.kernel.org/r/20251008214532.3442967-28-matthew.brost@intel.com
2025-10-09 03:22:53 -07:00
Matthew Brost
c25c1010df drm/xe/vf: Replay GuC submission state on pause / unpause
Fixup GuC submission pause / unpause functions to properly replay any
possible state lost during VF post migration recovery.

v3:
 - Add helpers for revert / replay (Tomasz)
 - Add comment around WQ NOPs (Tomasz)
v7:
 - Only fixup / replay parallel queues once (Testing)
 - Skip unpause step on queues created after resfix done (Testing)

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Tomasz Lis <tomasz.lis@intel.com>
Link: https://lore.kernel.org/r/20251008214532.3442967-27-matthew.brost@intel.com
2025-10-09 03:22:50 -07:00