Commit Graph

1382265 Commits

Author SHA1 Message Date
Thomas Hellström
0131514f97 drm/xe: Pass down drm_exec context to validation
We want all validation (potential backing store allocation) to be part
of a drm_exec transaction. Therefore add a drm_exec pointer argument
to xe_bo_validate() and ___xe_bo_create_locked(). Upcoming patches
will deal with making all (or nearly all) calls to these functions
part of a drm_exec transaction. In the meantime, define special values
of the drm_exec pointer:

XE_VALIDATION_UNIMPLEMENTED: Implementation of the drm_exec transaction
has not been done yet.
XE_VALIDATION_UNSUPPORTED: Some Middle-layers (dma-buf) doesn't allow
the drm_exec context to be passed down to map_attachment where
validation takes place.
XE_VALIDATION_OPT_OUT: May be used only for kunit tests where exhaustive
eviction isn't crucial and the ROI of converting those is very
small.

For XE_VALIDATION_UNIMPLEMENTED and XE_VALIDATION_OPT_OUT there is also
a lockdep check that a drm_exec transaction can indeed start at the
location where the macro is expanded. This is to encourage
developers to take this into consideration early in the code
development process.

v2:
- Fix xe_vm_set_validation_exec() imbalance. Add an assert that
  hopefully catches future instances of this (Matt Brost)
v3:
- Extend to psmi_alloc_object

Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com> #v3
Link: https://lore.kernel.org/r/20250908101246.65025-2-thomas.hellstrom@linux.intel.com
2025-09-10 09:15:52 +02:00
Julia Filipchuk
0d40ea7843 drm/xe/guc: Recommend GUC v70.49.4 for PTL, BMG
UAPI compatibility version 1.24.4

Resolves various BMG, SRIOV issues and adds new PTL features.

Signed-off-by: Julia Filipchuk <julia.filipchuk@intel.com>
Reviewed-by: Zhanjun Dong <zhanjun.dong@intel.com>
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
Link: https://lore.kernel.org/r/20250829235305.672287-7-julia.filipchuk@intel.com
2025-09-09 13:00:58 -07:00
Michal Wajdeczko
43fac1b2f0 drm/xe/guc: Don't invoke disable_ct action during replacement
During second CT initialization step, known as post_hwconfig, we
want to replace previously registered CT disable devm action to
make sure it will be invoked prior to releasing underlying BO.

But to replace this action we don't need to execute it right away
since we know that CT was disabled prior to that late init step
and we already assert that. Use devm_remove_action() instead to
avoid extra message about 'disabling CT' that could be seen now:

 [drm:guc_ct_change_state [xe]] GT0: GuC CT communication channel disabled
 ...
 DEVRES REM ff11000149320940 guc_action_disable_ct (16 bytes)
 [drm:guc_ct_change_state [xe]] GT0: GuC CT communication channel disabled
 DEVRES ADD ff110001664fc040 guc_action_disable_ct (16 bytes)

Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>
Link: https://lore.kernel.org/r/20250908102053.539-3-michal.wajdeczko@intel.com
2025-09-09 20:38:26 +02:00
Michal Wajdeczko
955f3bc4af drm/xe/guc: Always add CT disable action during second init step
On DGFX, during init_post_hwconfig() step, we are reinitializing
CTB BO in VRAM and we have to replace cleanup action to disable CT
communication prior to release of underlying BO.

But that introduces some discrepancy between DGFX and iGFX, as for
iGFX we keep previously added disable CT action that would be called
during unwind much later.

To keep the same flow on both types of platforms, always replace old
cleanup action and register new one.

Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>
Link: https://lore.kernel.org/r/20250908102053.539-2-michal.wajdeczko@intel.com
2025-09-09 20:38:25 +02:00
Matt Roper
30071d58df drm/xe: Never report L3 bank mask for media GT going forward
We currently report an L3 bank mask as part of the GT topology on both
GTs (primary and media) because a copy of the L3 bank fuse register
exists on both GTs (e.g., $gsi_offset + 0x9130 on Xe3).  After recent
discussions it's come to light that the only known userspace software
that uses this part of the uapi (the compute UMD and Mesa) only uses the
value reported for the primary GT; the value reported for the media GT
is ignored by both projects, and the media UMDs don't have any use for
L3 information today.  Since we always strive to have our uapi match the
specific needs of userspace and not include additional unused baggage,
let's officially drop L3 bank reporting on the media GT going forward
and only keep it around for the primary GT where it actually gets used.
This change will only apply to future platforms (Xe3 and later); even
though it would probably be safe to remove it from Xe1/Xe2 as well, we
don't want to take any chances with changing existing ABI.

Note that we'd already disabled reading/reporting of the L3 bank for the
media GT on PTL in commit 9ab440a9d0 ("drm/xe/ptl: L3bank mask is not
available on the media GT") because it was discovered that the copy of
the fuse registers on the media GT were just reporting a bogus ~0 value
rather than an accurate mask.  So this is just extending that PTL
behavior forward to WCL and other future platforms.  Note that we're
also free to reinstate this part of the uapi in the future if/when some
new userspace consumer emerges that _does_ have a use for media-specific
L3 bank masks.

Cc: Fei Yang <fei.yang@intel.com>
Reviewed-by: Gustavo Sousa <gustavo.sousa@intel.com>
Link: https://lore.kernel.org/r/20250905215614.796247-3-matthew.d.roper@intel.com
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
2025-09-09 10:04:29 -07:00
Michal Wajdeczko
f261f5ddde drm/xe/debugfs: Don't expose dgfx residencies attributes on VF
In addition of checking if we are running on the BATTLEMAGE,
we should also check for not being a VF driver, as VFs can't
access necessary registers, and doing so leads to:

 | .. [drm] GT0: VF is trying to read an inaccessible register 0x35b004+0x0
 | RIP: 0010:xe_gt_sriov_vf_read32+0x5e2/0x8a0 [xe]
 | Call Trace:
 |  xe_mmio_read32+0x110/0x280 [xe]
 |  read_residency_counter+0x42/0xd0 [xe]
 |  dgfx_pkg_residencies_show+0x115/0x190 [xe]
 | .. [drm] Package G2 counter failed to read, ret -19

or

 | .. [drm] GT0: VF is trying to read an inaccessible register 0x35b004+0x0
 | RIP: 0010:xe_gt_sriov_vf_read32+0x5e2/0x8a0 [xe]
 | Call Trace:
 |  xe_mmio_read32+0x110/0x280 [xe]
 |  read_residency_counter+0x42/0xd0 [xe]
 |  dgfx_pcie_link_residencies_show+0xe7/0x160 [xe]
 | .. [drm] PCIE LINK L0 RESIDENCY counter failed to read, ret -19

Similarly, there is no point to expose inject_csc_hw_error on VFs,
as HW errors support is already disabled for VFs.

Bspec: 53221
Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Soham Purkait <soham.purkait@intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Riana Tauro <riana.tauro@intel.com>
Reviewed-by: Gustavo Sousa <gustavo.sousa@intel.com>
Acked-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Link: https://lore.kernel.org/r/20250905173625.8398-1-michal.wajdeczko@intel.com
2025-09-09 17:29:10 +02:00
Christophe JAILLET
7b77941724 drm/xe/hwmon: Use devm_mutex_init()
Use devm_mutex_init() instead of hand-writing it.

This saves some LoC, improves readability and saves some space in the
generated .o file.

Before:
======
   text	   data	    bss	    dec	    hex	filename
  36884	  10296	     64	  47244	   b88c	drivers/gpu/drm/xe/xe_hwmon.o

After:
=====
   text	   data	    bss	    dec	    hex	filename
  36651	  10224	     64	  46939	   b75b	drivers/gpu/drm/xe/xe_hwmon.o

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Stuart Summers <stuart.summers@intel.com>
Link: https://lore.kernel.org/r/989e96369e9e1f8a44b816962917ec76877c912d.1757252520.git.christophe.jaillet@wanadoo.fr
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2025-09-09 06:31:13 -07:00
Michal Wajdeczko
e57ae80fe0 drm/xe/debugfs: Make residencies definitions const
No need to keep them non-const. Also fix declaration of .name
member, as it points to the const string. This translates to:

  add/remove: 1/0 grow/shrink: 0/2 up/down: 80/-248 (-168)
  Function                                     old     new   delta
  residencies                                    -      80     +80
  dgfx_pcie_link_residencies_show              365     263    -102
  dgfx_pkg_residencies_show                    454     308    -146
  Total: Before=2821548, After=2821380, chg -0.01%

Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Soham Purkait <soham.purkait@intel.com>
Cc: Jonathan Cavitt <jonathan.cavitt@intel.com>
Cc: Karthik Poosa <karthik.poosa@intel.com>
Cc: Riana Tauro <riana.tauro@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://lore.kernel.org/r/20250905180225.8434-1-michal.wajdeczko@intel.com
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2025-09-09 06:05:45 -07:00
Raag Jadav
fce99326c9 drm/xe/i2c: Enable bus mastering
Enable bus mastering for I2C controller to support device initiated
in-band transactions.

Signed-off-by: Raag Jadav <raag.jadav@intel.com>
Reviewed-by: Heikki Krogerus <heikki.krogerus@linux.intel.com>
Link: https://lore.kernel.org/r/20250908055320.2549722-1-raag.jadav@intel.com
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2025-09-09 06:02:35 -07:00
Michal Wajdeczko
fd548b77d5 drm/xe/vf: Move VF CCS debugfs attribute
The VF CCS handling is per-device so its debugfs file should not
be exposed on per-GT basis. Move it up to the device level.

Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>
Reviewed-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>
Link: https://lore.kernel.org/r/20250908123025.747-8-michal.wajdeczko@intel.com
2025-09-09 11:27:52 +02:00
Michal Wajdeczko
55ddca2a3c drm/xe/vf: Move VF CCS data to xe_device
We only need single set of VF CCS contexts, they are not per-tile
as initial implementation might suggest. Move all VF CCS data from
xe_tile.sriov.vf to xe_device.sriov.vf. Also rename some structs to
align with the usage and fix their kernel-doc.

Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>
Reviewed-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>
Link: https://lore.kernel.org/r/20250908123025.747-7-michal.wajdeczko@intel.com
2025-09-09 11:27:50 +02:00
Michal Wajdeczko
e699700834 drm/xe/bo: Add xe_bo_has_valid_ccs_bb helper
This will allow as to drop ugly IS_VF_CCS_BB_VALID macro.

Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>
Link: https://lore.kernel.org/r/20250908123025.747-6-michal.wajdeczko@intel.com
2025-09-09 11:27:49 +02:00
Michal Wajdeczko
b179dfd0db drm/xe/vf: Use single check when calling VF CCS functions
All xe_sriov_vf_ccs() functions but init() expect to be called
when initialization was successful and CCS handling is ready.

Update IS_VF_CCS_READY macro and use it as single entry guard.

Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>
Link: https://lore.kernel.org/r/20250908123025.747-5-michal.wajdeczko@intel.com
2025-09-09 11:27:46 +02:00
Michal Wajdeczko
aa8d9d75ea drm/xe/vf: Drop IS_VF_CCS_INIT_NEEDED macro
We only use this macro once and we can open-code it to explicitly
show relevant conditions and avoid duplications.

Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>
Link: https://lore.kernel.org/r/20250908123025.747-4-michal.wajdeczko@intel.com
2025-09-09 11:27:45 +02:00
Michal Wajdeczko
4e5bc50ad2 drm/xe/guc: Use proper flag definitions when registering context
In H2G action context type is specified in flags dword in bits 2:1.
Use generic FIELD_PREP macro instead of misleading BIT logic.

Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://lore.kernel.org/r/20250908123025.747-3-michal.wajdeczko@intel.com
2025-09-09 11:27:44 +02:00
Michal Wajdeczko
dd432009f1 drm/xe/guc: Rename xe_guc_register_exec_queue
This function is dedicated for use by the VFs, we shouldn't use
name that might suggests it's general purpose. While there, update
asserts to better reflect intended usage.

Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>
Link: https://lore.kernel.org/r/20250908123025.747-2-michal.wajdeczko@intel.com
2025-09-09 11:27:42 +02:00
Lucas De Marchi
956f5e5bc8 drm/xe/configfs: Use config_group_put()
configfs has a config_group_put() helper that was adopted by
commit 88df7939d7 ("drm/xe/configfs: Rename struct xe_config_device").
Another pending work to add psmi later landed in commit
afe902848b ("drm/xe/configfs: Allow to enable PSMI") and didn't use
the helper.

Use config_group_put() consistently to hide the inner workings of
configfs. No change in behavior since it does exactly the same thing
as currently being done.

Cc: John Harrison <John.C.Harrison@Intel.com>
Reviewed-by: Gustavo Sousa <gustavo.sousa@intel.com>
Link: https://lore.kernel.org/r/20250905162236.578117-2-lucas.demarchi@intel.com
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2025-09-08 07:39:47 -07:00
John Harrison
cf423b928f drm/xe/guc: Fix badly worded error message
If a GuC id lookup failed, the error message was 'Not engine present',
which is bad in multiple ways - incorrect English and 'engines' are
now called 'exec queues' in this context. So fix it.

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Reviewed-by: Matt Atwood <matthew.s.atwood@intel.com>
Link: https://lore.kernel.org/r/20250904195752.3846138-3-John.C.Harrison@Intel.com
2025-09-05 13:31:34 -07:00
John Harrison
0b05857dc1 drm/xe/guc: Clean up of GuC 'CTL' defines
All the field generation for the CTL defines (used for GuC init data)
were hand-rolled rather than using FIELD_PREP/REG_GENMASK/BIT macros.

Also, there were a bunch of macros defined for verbosity settings that
were never used.

So fix that all up.

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://lore.kernel.org/r/20250904195752.3846138-2-John.C.Harrison@Intel.com
2025-09-05 13:31:32 -07:00
Julia Filipchuk
6fc957185e drm/xe: Extend Wa_13011645652 to PTL-H, WCL
Expand workaround to additional graphics architectures.

Cc: Vinay Belgaumkar <vinay.belgaumkar@intel.com>
Cc: Stuart Summers <stuart.summers@intel.com>
Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: intel-xe@lists.freedesktop.org
Cc: <stable@vger.kernel.org> # v6.17+
Signed-off-by: Julia Filipchuk <julia.filipchuk@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://lore.kernel.org/r/20250903190122.1028373-2-julia.filipchuk@intel.com
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2025-09-05 11:54:30 -07:00
Thomas Hellström
599334572a drm/xe: Block exec and rebind worker while evicting for suspend / hibernate
When the xe pm_notifier evicts for suspend / hibernate, there might be
racing tasks trying to re-validate again. This can lead to suspend taking
excessive time or get stuck in a live-lock. This behaviour becomes
much worse with the fix that actually makes re-validation bring back
bos to VRAM rather than letting them remain in TT.

Prevent that by having exec and the rebind worker waiting for a completion
that is set to block by the pm_notifier before suspend and is signaled
by the pm_notifier after resume / wakeup.

It's probably still possible to craft malicious applications that block
suspending. More work is pending to fix that.

v3:
- Avoid wait_for_completion() in the kernel worker since it could
  potentially cause work item flushes from freezable processes to
  wait forever. Instead terminate the rebind workers if needed and
  re-launch at resume. (Matt Auld)
v4:
- Fix some bad naming and leftover debug printouts.
- Fix kerneldoc.
- Use drmm_mutex_init() for the xe->rebind_resume_lock (Matt Auld).
- Rework the interface of xe_vm_rebind_resume_worker (Matt Auld).

Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/4288
Fixes: c6a4d46ec1 ("drm/xe: evict user memory in PM notifier")
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: <stable@vger.kernel.org> # v6.16+
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Link: https://lore.kernel.org/r/20250904160715.2613-4-thomas.hellstrom@linux.intel.com
2025-09-05 17:03:35 +02:00
Thomas Hellström
ebd546fdff drm/xe: Allow the pm notifier to continue on failure
Its actions are opportunistic anyway and will be completed
on device suspend.

Marking as a fix to simplify backporting of the fix
that follows in the series.

v2:
- Keep the runtime pm reference over suspend / hibernate and
  document why. (Matt Auld, Rodrigo Vivi):

Fixes: c6a4d46ec1 ("drm/xe: evict user memory in PM notifier")
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: <stable@vger.kernel.org> # v6.16+
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Link: https://lore.kernel.org/r/20250904160715.2613-3-thomas.hellstrom@linux.intel.com
2025-09-05 17:03:35 +02:00
Thomas Hellström
cb3d7b3b46 drm/xe: Attempt to bring bos back to VRAM after eviction
VRAM+TT bos that are evicted from VRAM to TT may remain in
TT also after a revalidation following eviction or suspend.

This manifests itself as applications becoming sluggish
after buffer objects get evicted or after a resume from
suspend or hibernation.

If the bo supports placement in both VRAM and TT, and
we are on DGFX, mark the TT placement as fallback. This means
that it is tried only after VRAM + eviction.

This flaw has probably been present since the xe module was
upstreamed but use a Fixes: commit below where backporting is
likely to be simple. For earlier versions we need to open-
code the fallback algorithm in the driver.

v2:
- Remove check for dgfx. (Matthew Auld)
- Update the xe_dma_buf kunit test for the new strategy (CI)
- Allow dma-buf to pin in current placement (CI)
- Make xe_bo_validate() for pinned bos a NOP.

Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/5995
Fixes: a78a8da51b ("drm/ttm: replace busy placement with flags v6")
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: <stable@vger.kernel.org> # v6.9+
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Link: https://lore.kernel.org/r/20250904160715.2613-2-thomas.hellstrom@linux.intel.com
2025-09-05 17:03:35 +02:00
Sanjay Yadav
c4dfa0bea2 drm/xe/migrate: Remove unneeded emit_pte() when copying CCS only
In xe_migrate_copy(), when copy_only_ccs is true, we only need two
emit_pte() calls one for the BO and one for the raw CCS storage.
However, the current implementation issues three emit_pte() calls,
resulting in an unnecessary PTE programming job.

This fix removes the redundant emit_pte() call to avoid programming
the same PTEs twice and reducing overhead during CCS-only migration.

v2: Preserve correct behavior on DG2, which requires both CCS and
page copies.

Signed-off-by: Sanjay Yadav <sanjay.kumar.yadav@intel.com>
Suggested-by: Matthew Auld <matthew.auld@intel.com>
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Link: https://lore.kernel.org/r/20250904161423.2448727-1-sanjay.kumar.yadav@intel.com
2025-09-05 13:29:20 +01:00
Michal Wajdeczko
2d1e962098 drm/xe: Fix broken kernel-doc for the struct xe_bo
Use correct multi-line kernel-doc style if required.
Some members were described only in the commit message.
Some other members were described using wrong names.

Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Cc: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>
Reviewed-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>
Link: https://lore.kernel.org/r/20250904144026.7222-1-michal.wajdeczko@intel.com
2025-09-05 13:32:13 +02:00
Michal Wajdeczko
dcc38bc5e1 drm/xe/kunit: Drop xe_wa_test_exit
Remove xe_wa_test_exit() as it could crach the KUnit kernel in
case of hitting some asserts in xe_wa_test_init() as test->priv
could not be pointing to expected data.

 |    # xe_wa_gt: ASSERTION FAILED at drivers/gpu/drm/xe/tests/xe_wa_test.c:34
 |    Expected ret == 0, but
 |        ret == -19 (0xffffffffffffffed)
 |Bus error - the host /dev/shm or /tmp mount likely just ran out of space
 |Kernel panic - not syncing: Kernel mode signal 7

Note that there is no need to call drm_kunit_helper_free_device()
since our fake device allocated by drm_kunit_helper_alloc_device()
will be cleaned up automatically.

Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://lore.kernel.org/r/20250829171922.572-7-michal.wajdeczko@intel.com
2025-09-05 12:57:29 +02:00
Michal Wajdeczko
a9c8517058 drm/xe/kunit: Promote fake platform parameter list
The list of all known representative platforms defined in xe_wa
could be used in more places by other test suites.

Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://lore.kernel.org/r/20250829171922.572-6-michal.wajdeczko@intel.com
2025-09-05 12:57:27 +02:00
Michal Wajdeczko
ddbe5aecea drm/xe/kunit: Drop custom struct platform_test_case
Custom struct platform_test_case definition in xe_wa is now almost
identical to generic struct xe_pci_fake_data defintiion except the
.name member, which could be generated by xe_pci_fake_data_desc().

Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://lore.kernel.org/r/20250829171922.572-5-michal.wajdeczko@intel.com
2025-09-05 12:57:26 +02:00
Michal Wajdeczko
b1ee655843 drm/xe/kunit: Introduce xe_pci_fake_data_desc()
We already use struct xe_pci_fake_data to provide custom config of
the fake PCI device and soon we will be using this struct also as
direct parameter for the parameterized Xe KUnit tests.

Add function to generate description based on that config data.

For platform or subplatform name lookup pciidlist which already
have definitions of all supported platforms.

Examples:

  TIGERLAKE
  TIGERLAKE A0
  TIGERLAKE SR-IOV PF
  ...
  PANTHERLAKE 30.00(Xe3_LPG) 30.00(Xe3_LPM)
  PANTHERLAKE 30.00(Xe3_LPG) A0 30.00(Xe3_LPM) A0
  PANTHERLAKE 30.00(Xe3_LPG) A0 30.00(Xe3_LPM) A0 SR-IOV VF

Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://lore.kernel.org/r/20250829171922.572-4-michal.wajdeczko@intel.com
2025-09-05 12:57:25 +02:00
Michal Wajdeczko
42367babd8 drm/xe/kunit: Update struct xe_pci_fake_data step declarations
The struct xe_pci_fake_data has fields that specify graphics and
media stepping of the fake PCI device used during KUnit testing.

Change definitions of those separate step fields and use existing
struct xe_step_info definition that already have required fields.

Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://lore.kernel.org/r/20250829171922.572-3-michal.wajdeczko@intel.com
2025-09-05 12:57:23 +02:00
Michal Wajdeczko
981daf1046 drm/xe: Allow to stub lookup for graphics and media IP
In upcoming patch we will want to replace lookup code during the
test to relax the strict match that we use in production.

Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://lore.kernel.org/r/20250829171922.572-2-michal.wajdeczko@intel.com
2025-09-05 12:57:22 +02:00
Matthew Auld
edb1745fc6 drm/xe: improve dma-resv handling for backup object
Since the dma-resv is shared we don't need to reserve and add a fence
slot fence twice, plus no need to loop through the dependencies.

Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com>
Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Link: https://lore.kernel.org/r/20250829164715.720735-2-matthew.auld@intel.com
2025-09-05 11:53:00 +01:00
Matthew Auld
7477c4bd20 drm/xe/pt: unify xe_pt_svm_pre_commit with userptr
We now use the same notifier lock for SVM and userptr, with that we can
combine xe_pt_userptr_pre_commit and xe_pt_svm_pre_commit.

v2: (Matt B)
  - Re-use xe_svm_notifier_lock/unlock for userptr.
  - Combine svm/userptr handling further down into op_check_svm_userptr.
v3:
  - Only hide the ops if we lack DRM_GPUSVM, since we also need them for
    userptr.

Suggested-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://lore.kernel.org/r/20250828142430.615826-18-matthew.auld@intel.com
2025-09-05 11:45:47 +01:00
Matthew Auld
9e97874148 drm/xe/userptr: replace xe_hmm with gpusvm
Goal here is cut over to gpusvm and remove xe_hmm, relying instead on
common code. The core facilities we need are get_pages(), unmap_pages()
and free_pages() for a given useptr range, plus a vm level notifier
lock, which is now provided by gpusvm.

v2:
  - Reuse the same SVM vm struct we use for full SVM, that way we can
    use the same lock (Matt B & Himal)
v3:
  - Re-use svm_init/fini for userptr.
v4:
  - Allow building xe without userptr if we are missing DRM_GPUSVM
    config. (Matt B)
  - Always make .read_only match xe_vma_read_only() for the ctx. (Dafna)
v5:
  - Fix missing conversion with CONFIG_DRM_XE_USERPTR_INVAL_INJECT
v6:
  - Convert the new user in xe_vm_madise.

Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Dafna Hirschfeld <dafna.hirschfeld@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://lore.kernel.org/r/20250828142430.615826-17-matthew.auld@intel.com
2025-09-05 11:45:47 +01:00
Matthew Auld
dd25b995a2 drm/xe/vm: split userptr bits into separate file
This will simplify compiling out the bits that depend on DRM_GPUSVM in a
later patch. Without this we end up littering the code with ifdef
checks, plus it becomes hard to be sure that something won't blow at
runtime due to something not being initialised, even though it passed
the build. Should be no functional change here.

Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://lore.kernel.org/r/20250828142430.615826-16-matthew.auld@intel.com
2025-09-05 11:45:47 +01:00
Matthew Auld
83f706ecbd drm/gpusvm: export drm_gpusvm_pages API
Export get/unmap/free pages API. We also need to tweak the SVM init to
allow skipping much of the unneeded parts.

Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://lore.kernel.org/r/20250828142430.615826-15-matthew.auld@intel.com
2025-09-05 11:45:47 +01:00
Matthew Auld
6364afd532 drm/gpusvm: refactor core API to use pages struct
Refactor the core API of get/unmap/free pages to all operate on
drm_gpusvm_pages. In the next patch we want to export a simplified core
API without needing fully blown svm range etc.

Suggested-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://lore.kernel.org/r/20250828142430.615826-14-matthew.auld@intel.com
2025-09-05 11:45:46 +01:00
Matthew Auld
f70da6f99d drm/gpusvm: pull out drm_gpusvm_pages substructure
Pull the pages stuff from the svm range into its own substructure, with
the idea of having the main pages related routines, like get_pages(),
unmap_pages() and free_pages() all operating on some lower level
structures, which can then be re-used for stuff like userptr.

v2:
  - Move seq into pages struct (Matt B)
v3:
  - Small kernel-doc fixes

Suggested-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://lore.kernel.org/r/20250828142430.615826-13-matthew.auld@intel.com
2025-09-05 11:45:46 +01:00
Matthew Auld
ad70e289ed drm/gpusvm: use more selective dma dir in get_pages()
If we are only reading the memory then from the device pov the direction
can be DMA_TO_DEVICE. This aligns with the xe-userptr code. Using the
most restrictive data direction to represent the access is normally a
good idea.

Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://lore.kernel.org/r/20250828142430.615826-12-matthew.auld@intel.com
2025-09-05 11:45:46 +01:00
Matthew Auld
c50729c68a drm/gpusvm: fix hmm_pfn_to_map_order() usage
Handle the case where the hmm range partially covers a huge page (like
2M), otherwise we can potentially end up doing something nasty like
mapping memory which is outside the range, and maybe not even mapped by
the mm. Fix is based on the xe userptr code, which in a future patch
will directly use gpusvm, so needs alignment here.

v2:
  - Add kernel-doc (Matt B)
  - s/fls/ilog2/ (Thomas)

Reported-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Link: https://lore.kernel.org/r/20250828142430.615826-11-matthew.auld@intel.com
2025-09-05 11:45:46 +01:00
Harish Chegondi
5952d80514 drm/xe/xe2hpg: Add Wa_18041344222 for Xe2_HPG
Add Wa_18041344222 for Xe2_HPG that requires disabling
the perf mode for subslice count for eustall sampling
when the enabled slices are discontiguous.

Bspec: 79483, 56024
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Reviewed-by: Matt Atwood <matthew.s.atwood@intel.com>
Signed-off-by: Harish Chegondi <harish.chegondi@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://lore.kernel.org/r/b6a631a13a9fb7360e89d679e0797fae42d5a09e.1756855529.git.harish.chegondi@intel.com
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2025-09-05 03:02:34 -07:00
Harish Chegondi
6ee8adf124 drm/xe/mcr: Make xe_gt_mcr_get_dss_steering() input gt a const
Make gt, input parameter to xe_gt_mcr_get_dss_steering(), a
constant. This would allow xe_gt_mcr_get_dss_steering() to
be called from functions that have gt as const to struct xe_gt.

Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Reviewed-by: Matt Atwood <matthew.s.atwood@intel.com>
Signed-off-by: Harish Chegondi <harish.chegondi@intel.com>
Link: https://lore.kernel.org/r/9dc621a90880f62ac8e2951afea7952277f7eb0e.1756855529.git.harish.chegondi@intel.com
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2025-09-05 03:02:34 -07:00
Michal Wajdeczko
3088f485de drm/xe/configfs: Don't expose survivability_mode if not applicable
The survivability_mode attribute is applicable only for DGFX and
platforms newer than BATTLEMAGE. Use .is_visible() hook to hide
this attribute when above conditions are not met. Remove code that
was trying to fix such configuration during the runtime.

Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Cc: Riana Tauro <riana.tauro@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Reviewed-by: Stuart Summers <stuart.summers@intel.com>
Link: https://lore.kernel.org/r/20250902131744.5076-4-michal.wajdeczko@intel.com
2025-09-04 22:33:51 +02:00
Michal Wajdeczko
b076d32177 drm/xe/configfs: Prepare to filter-out configfs attributes
Implement empty ops.is_visible hook to allow filtering-out any not
supported attributes, as not all of them are applicable on all xe
platforms.

Since during creation of each new configfs directory we are looking
for xe device descriptor to validate that xe driver supports given
PCI device, store reference to that descriptor to allow later use
while doing attribute filtering.

Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Reviewed-by: Stuart Summers <stuart.summers@intel.com>
Link: https://lore.kernel.org/r/20250902131744.5076-3-michal.wajdeczko@intel.com
2025-09-04 22:32:46 +02:00
Michal Wajdeczko
079a5c83db drm/xe/configfs: Don't touch survivability_mode on fini
This is a user controlled configfs attribute, we should not
modify that outside the configfs attr.store() implementation.

Fixes: bc417e54e2 ("drm/xe: Enable configfs support for survivability mode")
Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Cc: Riana Tauro <riana.tauro@intel.com>
Reviewed-by: Stuart Summers <stuart.summers@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://lore.kernel.org/r/20250904103521.7130-1-michal.wajdeczko@intel.com
2025-09-04 22:31:26 +02:00
Michal Wajdeczko
2506af5f81 drm/xe/guc: Set upper limit of H2G retries over CTB
The GuC communication protocol allows GuC to send NO_RESPONSE_RETRY
reply message to indicate that due to some interim condition it can
not handle incoming H2G request and the host shall resend it.

But in some cases, due to errors, this unsatisfied condition might
be final and this could lead to endless retries as it was recently
seen on the CI:

 [drm] GT0: PF: VF1 FLR didn't finish in 5000 ms (-ETIMEDOUT)
 [drm] GT0: PF: VF1 resource sanitizing failed (-ETIMEDOUT)
 [drm] GT0: PF: VF1 FLR failed!
 [drm:guc_ct_send_recv [xe]] GT0: H2G action 0x5503 retrying: reason 0x0
 [drm:guc_ct_send_recv [xe]] GT0: H2G action 0x5503 retrying: reason 0x0
 [drm:guc_ct_send_recv [xe]] GT0: H2G action 0x5503 retrying: reason 0x0
 [drm:guc_ct_send_recv [xe]] GT0: H2G action 0x5503 retrying: reason 0x0

To avoid such dangerous loops allow only limited number of retries
(for now 50) and add some delays (n * 5ms) to slow down the rate of
resending this repeated request.

Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: John Harrison <John.C.Harrison@Intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Stuart Summers <stuart.summers@intel.com>
Reviewed-by: Julia Filipchuk <julia.filipchuk@intel.com>
Link: https://lore.kernel.org/r/20250903223330.6408-1-michal.wajdeczko@intel.com
2025-09-04 22:24:51 +02:00
Michal Wajdeczko
a85ead6d7f drm/xe/debugfs: Move sa_info from gt to tile directory
Our drm-based suballocator is implemented per-tile so it is better
to show its debug information also per-tile debugfs directory, not
under per-gt directory as it is done today.

To allow adding more per-tile attributes, prepare necessary helper
functions, like we already did for per-gt or per-uc attributes.

Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Link: https://lore.kernel.org/r/20250829201106.1263-1-michal.wajdeczko@intel.com
2025-09-04 12:45:09 +02:00
Himal Prasad Ghimiray
fece859855 drm/xe/vm: Fix error handling in xe_vm_query_vmas_attrs_ioctl()
copy_to_user() returns the number of bytes not copied on failure, not a
negative error code. Update the logic to return -EFAULT instead of the
number of bytes to correctly signal the error.

Fixes: 418807860e ("drm/xe/uapi: Add UAPI for querying VMA count and memory attributes")
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
Link: https://lore.kernel.org/r/20250828104933.3839825-3-himal.prasad.ghimiray@intel.com
Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
2025-09-04 11:58:21 +05:30
Himal Prasad Ghimiray
294912f93d drm/xe: Fix indentation in xe_zap_ptes_in_madvise_range
Fix misleading indentation around WRITE_ONCE in pte zap loop.
No functional change intended.

Fixes: ada7486c56 ("drm/xe: Implement madvise ioctl for xe")
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
Link: https://lore.kernel.org/r/20250828104933.3839825-2-himal.prasad.ghimiray@intel.com
Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
2025-09-04 11:58:21 +05:30
Matthew Brost
4208fac3dc drm/xe: Add more SVM GT stats
Add more SVM GT stats which give visibility to where time is spent in
the SVM page fault handler. Stats include number of faults at a given
size, total SVM page fault time, migration time in us, copy time in us,
copy kb, get pages time in us, and bind time in us. Will help in tuning
SVM for performance.

v2:
 - Include local changes
v3:
 - Add tlb invalidation + valid page fault + per size copy size stats
v4:
 - Ensure gt not NULL when incrementing SVM copy stats
 - Normalize stats names
 - Use magic macros to generate increment functions for ranges
v7:
 - Use DEF_STAT_STR (Michal)

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Francois Dugast <francois.dugast@intel.com>
Link: https://lore.kernel.org/r/20250829172232.1308004-3-matthew.brost@intel.com
2025-09-02 22:23:08 -07:00