Commit Graph

1382240 Commits

Author SHA1 Message Date
Michal Wajdeczko
dcc38bc5e1 drm/xe/kunit: Drop xe_wa_test_exit
Remove xe_wa_test_exit() as it could crach the KUnit kernel in
case of hitting some asserts in xe_wa_test_init() as test->priv
could not be pointing to expected data.

 |    # xe_wa_gt: ASSERTION FAILED at drivers/gpu/drm/xe/tests/xe_wa_test.c:34
 |    Expected ret == 0, but
 |        ret == -19 (0xffffffffffffffed)
 |Bus error - the host /dev/shm or /tmp mount likely just ran out of space
 |Kernel panic - not syncing: Kernel mode signal 7

Note that there is no need to call drm_kunit_helper_free_device()
since our fake device allocated by drm_kunit_helper_alloc_device()
will be cleaned up automatically.

Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://lore.kernel.org/r/20250829171922.572-7-michal.wajdeczko@intel.com
2025-09-05 12:57:29 +02:00
Michal Wajdeczko
a9c8517058 drm/xe/kunit: Promote fake platform parameter list
The list of all known representative platforms defined in xe_wa
could be used in more places by other test suites.

Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://lore.kernel.org/r/20250829171922.572-6-michal.wajdeczko@intel.com
2025-09-05 12:57:27 +02:00
Michal Wajdeczko
ddbe5aecea drm/xe/kunit: Drop custom struct platform_test_case
Custom struct platform_test_case definition in xe_wa is now almost
identical to generic struct xe_pci_fake_data defintiion except the
.name member, which could be generated by xe_pci_fake_data_desc().

Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://lore.kernel.org/r/20250829171922.572-5-michal.wajdeczko@intel.com
2025-09-05 12:57:26 +02:00
Michal Wajdeczko
b1ee655843 drm/xe/kunit: Introduce xe_pci_fake_data_desc()
We already use struct xe_pci_fake_data to provide custom config of
the fake PCI device and soon we will be using this struct also as
direct parameter for the parameterized Xe KUnit tests.

Add function to generate description based on that config data.

For platform or subplatform name lookup pciidlist which already
have definitions of all supported platforms.

Examples:

  TIGERLAKE
  TIGERLAKE A0
  TIGERLAKE SR-IOV PF
  ...
  PANTHERLAKE 30.00(Xe3_LPG) 30.00(Xe3_LPM)
  PANTHERLAKE 30.00(Xe3_LPG) A0 30.00(Xe3_LPM) A0
  PANTHERLAKE 30.00(Xe3_LPG) A0 30.00(Xe3_LPM) A0 SR-IOV VF

Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://lore.kernel.org/r/20250829171922.572-4-michal.wajdeczko@intel.com
2025-09-05 12:57:25 +02:00
Michal Wajdeczko
42367babd8 drm/xe/kunit: Update struct xe_pci_fake_data step declarations
The struct xe_pci_fake_data has fields that specify graphics and
media stepping of the fake PCI device used during KUnit testing.

Change definitions of those separate step fields and use existing
struct xe_step_info definition that already have required fields.

Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://lore.kernel.org/r/20250829171922.572-3-michal.wajdeczko@intel.com
2025-09-05 12:57:23 +02:00
Michal Wajdeczko
981daf1046 drm/xe: Allow to stub lookup for graphics and media IP
In upcoming patch we will want to replace lookup code during the
test to relax the strict match that we use in production.

Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://lore.kernel.org/r/20250829171922.572-2-michal.wajdeczko@intel.com
2025-09-05 12:57:22 +02:00
Matthew Auld
edb1745fc6 drm/xe: improve dma-resv handling for backup object
Since the dma-resv is shared we don't need to reserve and add a fence
slot fence twice, plus no need to loop through the dependencies.

Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com>
Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Link: https://lore.kernel.org/r/20250829164715.720735-2-matthew.auld@intel.com
2025-09-05 11:53:00 +01:00
Matthew Auld
7477c4bd20 drm/xe/pt: unify xe_pt_svm_pre_commit with userptr
We now use the same notifier lock for SVM and userptr, with that we can
combine xe_pt_userptr_pre_commit and xe_pt_svm_pre_commit.

v2: (Matt B)
  - Re-use xe_svm_notifier_lock/unlock for userptr.
  - Combine svm/userptr handling further down into op_check_svm_userptr.
v3:
  - Only hide the ops if we lack DRM_GPUSVM, since we also need them for
    userptr.

Suggested-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://lore.kernel.org/r/20250828142430.615826-18-matthew.auld@intel.com
2025-09-05 11:45:47 +01:00
Matthew Auld
9e97874148 drm/xe/userptr: replace xe_hmm with gpusvm
Goal here is cut over to gpusvm and remove xe_hmm, relying instead on
common code. The core facilities we need are get_pages(), unmap_pages()
and free_pages() for a given useptr range, plus a vm level notifier
lock, which is now provided by gpusvm.

v2:
  - Reuse the same SVM vm struct we use for full SVM, that way we can
    use the same lock (Matt B & Himal)
v3:
  - Re-use svm_init/fini for userptr.
v4:
  - Allow building xe without userptr if we are missing DRM_GPUSVM
    config. (Matt B)
  - Always make .read_only match xe_vma_read_only() for the ctx. (Dafna)
v5:
  - Fix missing conversion with CONFIG_DRM_XE_USERPTR_INVAL_INJECT
v6:
  - Convert the new user in xe_vm_madise.

Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Dafna Hirschfeld <dafna.hirschfeld@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://lore.kernel.org/r/20250828142430.615826-17-matthew.auld@intel.com
2025-09-05 11:45:47 +01:00
Matthew Auld
dd25b995a2 drm/xe/vm: split userptr bits into separate file
This will simplify compiling out the bits that depend on DRM_GPUSVM in a
later patch. Without this we end up littering the code with ifdef
checks, plus it becomes hard to be sure that something won't blow at
runtime due to something not being initialised, even though it passed
the build. Should be no functional change here.

Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://lore.kernel.org/r/20250828142430.615826-16-matthew.auld@intel.com
2025-09-05 11:45:47 +01:00
Matthew Auld
83f706ecbd drm/gpusvm: export drm_gpusvm_pages API
Export get/unmap/free pages API. We also need to tweak the SVM init to
allow skipping much of the unneeded parts.

Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://lore.kernel.org/r/20250828142430.615826-15-matthew.auld@intel.com
2025-09-05 11:45:47 +01:00
Matthew Auld
6364afd532 drm/gpusvm: refactor core API to use pages struct
Refactor the core API of get/unmap/free pages to all operate on
drm_gpusvm_pages. In the next patch we want to export a simplified core
API without needing fully blown svm range etc.

Suggested-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://lore.kernel.org/r/20250828142430.615826-14-matthew.auld@intel.com
2025-09-05 11:45:46 +01:00
Matthew Auld
f70da6f99d drm/gpusvm: pull out drm_gpusvm_pages substructure
Pull the pages stuff from the svm range into its own substructure, with
the idea of having the main pages related routines, like get_pages(),
unmap_pages() and free_pages() all operating on some lower level
structures, which can then be re-used for stuff like userptr.

v2:
  - Move seq into pages struct (Matt B)
v3:
  - Small kernel-doc fixes

Suggested-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://lore.kernel.org/r/20250828142430.615826-13-matthew.auld@intel.com
2025-09-05 11:45:46 +01:00
Matthew Auld
ad70e289ed drm/gpusvm: use more selective dma dir in get_pages()
If we are only reading the memory then from the device pov the direction
can be DMA_TO_DEVICE. This aligns with the xe-userptr code. Using the
most restrictive data direction to represent the access is normally a
good idea.

Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://lore.kernel.org/r/20250828142430.615826-12-matthew.auld@intel.com
2025-09-05 11:45:46 +01:00
Matthew Auld
c50729c68a drm/gpusvm: fix hmm_pfn_to_map_order() usage
Handle the case where the hmm range partially covers a huge page (like
2M), otherwise we can potentially end up doing something nasty like
mapping memory which is outside the range, and maybe not even mapped by
the mm. Fix is based on the xe userptr code, which in a future patch
will directly use gpusvm, so needs alignment here.

v2:
  - Add kernel-doc (Matt B)
  - s/fls/ilog2/ (Thomas)

Reported-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Link: https://lore.kernel.org/r/20250828142430.615826-11-matthew.auld@intel.com
2025-09-05 11:45:46 +01:00
Harish Chegondi
5952d80514 drm/xe/xe2hpg: Add Wa_18041344222 for Xe2_HPG
Add Wa_18041344222 for Xe2_HPG that requires disabling
the perf mode for subslice count for eustall sampling
when the enabled slices are discontiguous.

Bspec: 79483, 56024
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Reviewed-by: Matt Atwood <matthew.s.atwood@intel.com>
Signed-off-by: Harish Chegondi <harish.chegondi@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://lore.kernel.org/r/b6a631a13a9fb7360e89d679e0797fae42d5a09e.1756855529.git.harish.chegondi@intel.com
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2025-09-05 03:02:34 -07:00
Harish Chegondi
6ee8adf124 drm/xe/mcr: Make xe_gt_mcr_get_dss_steering() input gt a const
Make gt, input parameter to xe_gt_mcr_get_dss_steering(), a
constant. This would allow xe_gt_mcr_get_dss_steering() to
be called from functions that have gt as const to struct xe_gt.

Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Reviewed-by: Matt Atwood <matthew.s.atwood@intel.com>
Signed-off-by: Harish Chegondi <harish.chegondi@intel.com>
Link: https://lore.kernel.org/r/9dc621a90880f62ac8e2951afea7952277f7eb0e.1756855529.git.harish.chegondi@intel.com
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2025-09-05 03:02:34 -07:00
Michal Wajdeczko
3088f485de drm/xe/configfs: Don't expose survivability_mode if not applicable
The survivability_mode attribute is applicable only for DGFX and
platforms newer than BATTLEMAGE. Use .is_visible() hook to hide
this attribute when above conditions are not met. Remove code that
was trying to fix such configuration during the runtime.

Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Cc: Riana Tauro <riana.tauro@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Reviewed-by: Stuart Summers <stuart.summers@intel.com>
Link: https://lore.kernel.org/r/20250902131744.5076-4-michal.wajdeczko@intel.com
2025-09-04 22:33:51 +02:00
Michal Wajdeczko
b076d32177 drm/xe/configfs: Prepare to filter-out configfs attributes
Implement empty ops.is_visible hook to allow filtering-out any not
supported attributes, as not all of them are applicable on all xe
platforms.

Since during creation of each new configfs directory we are looking
for xe device descriptor to validate that xe driver supports given
PCI device, store reference to that descriptor to allow later use
while doing attribute filtering.

Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Reviewed-by: Stuart Summers <stuart.summers@intel.com>
Link: https://lore.kernel.org/r/20250902131744.5076-3-michal.wajdeczko@intel.com
2025-09-04 22:32:46 +02:00
Michal Wajdeczko
079a5c83db drm/xe/configfs: Don't touch survivability_mode on fini
This is a user controlled configfs attribute, we should not
modify that outside the configfs attr.store() implementation.

Fixes: bc417e54e2 ("drm/xe: Enable configfs support for survivability mode")
Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Cc: Riana Tauro <riana.tauro@intel.com>
Reviewed-by: Stuart Summers <stuart.summers@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://lore.kernel.org/r/20250904103521.7130-1-michal.wajdeczko@intel.com
2025-09-04 22:31:26 +02:00
Michal Wajdeczko
2506af5f81 drm/xe/guc: Set upper limit of H2G retries over CTB
The GuC communication protocol allows GuC to send NO_RESPONSE_RETRY
reply message to indicate that due to some interim condition it can
not handle incoming H2G request and the host shall resend it.

But in some cases, due to errors, this unsatisfied condition might
be final and this could lead to endless retries as it was recently
seen on the CI:

 [drm] GT0: PF: VF1 FLR didn't finish in 5000 ms (-ETIMEDOUT)
 [drm] GT0: PF: VF1 resource sanitizing failed (-ETIMEDOUT)
 [drm] GT0: PF: VF1 FLR failed!
 [drm:guc_ct_send_recv [xe]] GT0: H2G action 0x5503 retrying: reason 0x0
 [drm:guc_ct_send_recv [xe]] GT0: H2G action 0x5503 retrying: reason 0x0
 [drm:guc_ct_send_recv [xe]] GT0: H2G action 0x5503 retrying: reason 0x0
 [drm:guc_ct_send_recv [xe]] GT0: H2G action 0x5503 retrying: reason 0x0

To avoid such dangerous loops allow only limited number of retries
(for now 50) and add some delays (n * 5ms) to slow down the rate of
resending this repeated request.

Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: John Harrison <John.C.Harrison@Intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Stuart Summers <stuart.summers@intel.com>
Reviewed-by: Julia Filipchuk <julia.filipchuk@intel.com>
Link: https://lore.kernel.org/r/20250903223330.6408-1-michal.wajdeczko@intel.com
2025-09-04 22:24:51 +02:00
Michal Wajdeczko
a85ead6d7f drm/xe/debugfs: Move sa_info from gt to tile directory
Our drm-based suballocator is implemented per-tile so it is better
to show its debug information also per-tile debugfs directory, not
under per-gt directory as it is done today.

To allow adding more per-tile attributes, prepare necessary helper
functions, like we already did for per-gt or per-uc attributes.

Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Link: https://lore.kernel.org/r/20250829201106.1263-1-michal.wajdeczko@intel.com
2025-09-04 12:45:09 +02:00
Himal Prasad Ghimiray
fece859855 drm/xe/vm: Fix error handling in xe_vm_query_vmas_attrs_ioctl()
copy_to_user() returns the number of bytes not copied on failure, not a
negative error code. Update the logic to return -EFAULT instead of the
number of bytes to correctly signal the error.

Fixes: 418807860e ("drm/xe/uapi: Add UAPI for querying VMA count and memory attributes")
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
Link: https://lore.kernel.org/r/20250828104933.3839825-3-himal.prasad.ghimiray@intel.com
Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
2025-09-04 11:58:21 +05:30
Himal Prasad Ghimiray
294912f93d drm/xe: Fix indentation in xe_zap_ptes_in_madvise_range
Fix misleading indentation around WRITE_ONCE in pte zap loop.
No functional change intended.

Fixes: ada7486c56 ("drm/xe: Implement madvise ioctl for xe")
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
Link: https://lore.kernel.org/r/20250828104933.3839825-2-himal.prasad.ghimiray@intel.com
Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
2025-09-04 11:58:21 +05:30
Matthew Brost
4208fac3dc drm/xe: Add more SVM GT stats
Add more SVM GT stats which give visibility to where time is spent in
the SVM page fault handler. Stats include number of faults at a given
size, total SVM page fault time, migration time in us, copy time in us,
copy kb, get pages time in us, and bind time in us. Will help in tuning
SVM for performance.

v2:
 - Include local changes
v3:
 - Add tlb invalidation + valid page fault + per size copy size stats
v4:
 - Ensure gt not NULL when incrementing SVM copy stats
 - Normalize stats names
 - Use magic macros to generate increment functions for ranges
v7:
 - Use DEF_STAT_STR (Michal)

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Francois Dugast <francois.dugast@intel.com>
Link: https://lore.kernel.org/r/20250829172232.1308004-3-matthew.brost@intel.com
2025-09-02 22:23:08 -07:00
Matthew Brost
56e6d56885 drm/xe: Add clearing stats to GT debugfs
It helpful to clear GT stats, run a test cases which is being profiled,
and look at the results of the stats from the individual test case. Make
stats entry writable and upon write clear the stats.

v5:
 - Drop clear_stats debugfs entry (Lucas)
v6:
 - Use xe_gt_stats_clear rather than helper (Michal)
 - Rework loop in xe_gt_stats_clear (Michal)

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Link: https://lore.kernel.org/r/20250829172232.1308004-2-matthew.brost@intel.com
2025-09-02 22:23:07 -07:00
Tangudu Tilak Tirumalesh
8d6f16f1f0 drm/xe: Extend Wa_22021007897 to Xe3 platforms
WA 22021007897 should also be applied to Graphics Versions 30.00, 30.01
and 30.03. To make it simple, simply use the range [3000, 3003] that
should be ok as there isn't a 3002 and if it's added, the WA list would
need to be revisited anyway.

Cc: Matt Atwood <matthew.s.atwood@intel.com>
Cc: Gustavo Sousa <gustavo.sousa@intel.com>
Signed-off-by: Tangudu Tilak Tirumalesh <tilak.tirumalesh.tangudu@intel.com>
Reviewed-by: Matt Atwood <matthew.s.atwood@intel.com>
Link: https://lore.kernel.org/r/20250827-wa-22021007897-v1-1-96922eb52af4@intel.com
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2025-09-02 20:10:44 -07:00
Zhanjun Dong
ad83b1da5b drm/xe/guc: Increase GuC crash dump buffer size
There are platforms already have a maximum dump size of 12KB, to avoid
data truncating, increase GuC crash dump buffer size to 16KB.

Signed-off-by: Zhanjun Dong <zhanjun.dong@intel.com>
Reviewed-by: Stuart Summers <stuart.summers@intel.com>
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Link: https://lore.kernel.org/r/20250829160427.1245732-1-zhanjun.dong@intel.com
2025-09-02 13:57:17 -07:00
Satyanarayana K V P
be5590c384 drm/xe/vf: Enable CCS save/restore only on supported GUC versions
CCS save/restore is supported starting with GuC 70.48.0 (compatibility
version 1.23.0). Gate the feature on the GuC firmware version and keep it
disabled on older or unsupported versions.

Fixes: f3009272ff ("drm/xe/vf: Create contexts for CCS read write")
Signed-off-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>
Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Cc: Andi Shyti <andi.shyti@kernel.org>
Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Link: https://lore.kernel.org/r/20250902103256.21658-2-satyanarayana.k.v.p@intel.com
2025-09-02 18:59:17 +02:00
Satyanarayana K V P
ee4b32220a drm/xe/guc: Add devm release action to safely tear down CT
When a buffer object (BO) is allocated with the XE_BO_FLAG_GGTT_INVALIDATE
flag, the driver initiates TLB invalidation requests via the CTB mechanism
while releasing the BO. However a premature release of the CTB BO can lead
to system crashes, as observed in:

Oops: Oops: 0000 [#1] SMP NOPTI
RIP: 0010:h2g_write+0x2f3/0x7c0 [xe]
Call Trace:
 guc_ct_send_locked+0x8b/0x670 [xe]
 xe_guc_ct_send_locked+0x19/0x60 [xe]
 send_tlb_invalidation+0xb4/0x460 [xe]
 xe_gt_tlb_invalidation_ggtt+0x15e/0x2e0 [xe]
 ggtt_invalidate_gt_tlb.part.0+0x16/0x90 [xe]
 ggtt_node_remove+0x110/0x140 [xe]
 xe_ggtt_node_remove+0x40/0xa0 [xe]
 xe_ggtt_remove_bo+0x87/0x250 [xe]

Introduce a devm-managed release action during xe_guc_ct_init() and
xe_guc_ct_init_post_hwconfig() to ensure proper CTB disablement before
resource deallocation, preventing the use-after-free scenario.

Signed-off-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>
Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: Summers Stuart <stuart.summers@intel.com>
Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Link: https://lore.kernel.org/r/20250901072541.31461-1-satyanarayana.k.v.p@intel.com
2025-09-02 08:21:58 +02:00
Thomas Hellström
1047bd8279 drm/xe: Fix incorrect migration of backed-up object to VRAM
If an object is backed up to shmem it is incorrectly identified
as not having valid data by the move code. This means moving
to VRAM skips the -EMULTIHOP step and the bo is cleared. This
causes all sorts of weird behaviour on DGFX if an already evicted
object is targeted by the shrinker.

Fix this by using ttm_tt_is_swapped() to identify backed-up
objects.

Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/5996
Fixes: 00c8efc318 ("drm/xe: Add a shrinker for xe bos")
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: <stable@vger.kernel.org> # v6.15+
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Link: https://lore.kernel.org/r/20250828134837.5709-1-thomas.hellstrom@linux.intel.com
2025-08-29 15:30:57 +02:00
Himal Prasad Ghimiray
eeb8117f5f drm/xe/uapi: Fix kernel-doc formatting for madvise and vma_query
Correct kernel-doc formatting issues in the UAPI definitions for
madvise and VMA query interfaces to resolve docutils warnings during
documentation build.

Fixes: 418807860e ("drm/xe/uapi: Add UAPI for querying VMA count and memory attributes")
Fixes: 231bb0ee7a ("drm/xe/uapi: Add madvise interface")
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://lore.kernel.org/r/20250828071516.3838110-1-himal.prasad.ghimiray@intel.com
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2025-08-28 07:08:08 -07:00
Riana Tauro
ed906b7076 drm/xe/nvm: Use root tile mmio
To allow initialization of nvm during early probe for future usecases,
use root tile instead of root gt to access mmios, as gt is not
yet initialized at early probe.

v2: fix commit message (Lucas)

Signed-off-by: Riana Tauro <riana.tauro@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://lore.kernel.org/r/20250825103537.2551837-1-riana.tauro@intel.com
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2025-08-28 06:33:05 -07:00
Marcin Bernatowicz
12267f831f drm/xe/tests: Make cross-device dma-buf BOs CPU-visible on small BAR
Small-BAR systems (e.g., SR-IOV VFs in VMs) expose only a subset of
VRAM via PCI/BAR. Exporting a BO outside that window fails, and the
selftests also do CPU fill/verify.

Set XE_BO_FLAG_NEEDS_CPU_ACCESS for cross-device variants to force
CPU-mappable placement and keep tests reliable. Large-BAR/P2P setups
are unaffected.

Suggested-by: Matthew Auld <matthew.auld@intel.com>
Signed-off-by: Marcin Bernatowicz <marcin.bernatowicz@linux.intel.com>
Cc: Matthew Auld <matthew.auld@intel.com>
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Link: https://lore.kernel.org/r/20250814145950.430231-1-marcin.bernatowicz@linux.intel.com
Signed-off-by: Michał Winiarski <michal.winiarski@intel.com>
2025-08-28 13:21:20 +02:00
Matthew Auld
81a45cb7ea drm/xe/migrate: make MI_TLB_INVALIDATE conditional
When clearing VRAM we should be able to skip invalidating the TLBs if we
are only using the identity map to access VRAM (which is the common
case), since no modifications are made to PTEs on the fly. Also since we
use huge 1G entries within the identity map, there should be a pretty
decent chance that the next packet(s) (if also clears) can avoid a tree
walk if we don't shoot down the TLBs, like if we have to process a long
stream of clears.

For normal moves/copies, we usually always end up with the src or dst
being system memory, meaning we can't only rely on the identity map and
will also need to emit PTEs and so will always require a TLB flush.

v2:
  - Update commit to explain the situation for normal copies (Matt B)
  - Rebase on latest changes

Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://lore.kernel.org/r/20250808110452.467513-2-matthew.auld@intel.com
2025-08-28 09:58:19 +01:00
Matthew Brost
db16f9d90c drm/xe: Split TLB invalidation code in frontend and backend
The frontend exposes an API to the driver to send invalidations, handles
sequence number assignment, synchronization (fences), and provides a
timeout mechanism. The backend issues the actual invalidation to the
hardware (or firmware).

The new layering easily allows issuing TLB invalidations to different
hardware or firmware interfaces.

Normalize some naming while here too.

Signed-off-by: Stuart Summers <stuart.summers@intel.com>
Reviewed-by: Stuart Summers <stuart.summers@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Link: https://lore.kernel.org/r/20250826182911.392550-10-stuart.summers@intel.com
2025-08-27 11:49:31 -07:00
Matthew Brost
8443e8c448 drm/xe: Add helpers to send TLB invalidations
Break out the GuC specific code into helpers as part of the process to
decouple frontback TLB invalidation code from the backend.

Signed-off-by: Stuart Summers <stuart.summers@intel.com>
Reviewed-by: Stuart Summers <stuart.summers@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Link: https://lore.kernel.org/r/20250826182911.392550-9-stuart.summers@intel.com
2025-08-27 11:49:27 -07:00
Matthew Brost
9aff63cf37 drm/xe: Prep TLB invalidation fence before sending
It is a bit backwards to add a TLB invalidation fence to the pending
list after issuing the invalidation. Perform this step before issuing
the TLB invalidation in a helper function.

v2: Make sure the seqno_lock mutex covers the send as well (Matt)

Signed-off-by: Stuart Summers <stuart.summers@intel.com>
Reviewed-by: Stuart Summers <stuart.summers@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Link: https://lore.kernel.org/r/20250826182911.392550-8-stuart.summers@intel.com
2025-08-27 11:49:24 -07:00
Matthew Brost
15366239e2 drm/xe: Decouple TLB invalidations from GT
Decouple TLB invalidations from the GT by updating the TLB invalidation
layer to accept a `struct xe_tlb_inval` instead of a `struct xe_gt`.
Also, rename *gt_tlb* to *tlb*. The internals of the TLB invalidation
code still operate on a GT, but this is now hidden from the rest of the
driver.

Signed-off-by: Stuart Summers <stuart.summers@intel.com>
Reviewed-by: Stuart Summers <stuart.summers@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Link: https://lore.kernel.org/r/20250826182911.392550-7-stuart.summers@intel.com
2025-08-27 11:49:18 -07:00
Matthew Brost
6d1e452e09 drm/xe: Add xe_gt_tlb_invalidation_done_handler
Decouple GT TLB seqno handling from G2H handler.

v2:
 - Add kernel doc

Reviewed-by: Stuart Summers <stuart.summers@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Link: https://lore.kernel.org/r/20250826182911.392550-6-stuart.summers@intel.com
2025-08-27 11:49:13 -07:00
Matthew Brost
594bb930fc drm/xe: Add xe_tlb_inval structure
Extract TLB invalidation state into a structure to decouple TLB
invalidations from the GT, allowing the structure to be embedded
anywhere in the driver.

Signed-off-by: Stuart Summers <stuart.summers@intel.com>
Reviewed-by: Stuart Summers <stuart.summers@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Link: https://lore.kernel.org/r/20250826182911.392550-5-stuart.summers@intel.com
2025-08-27 11:49:08 -07:00
Matthew Brost
c697ddcf27 drm/xe: s/tlb_invalidation/tlb_inval
tlb_invalidation is a bit verbose leading to ugly wraps in the code,
shorten to tlb_inval.

Signed-off-by: Stuart Summers <stuart.summers@intel.com>
Reviewed-by: Stuart Summers <stuart.summers@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Link: https://lore.kernel.org/r/20250826182911.392550-4-stuart.summers@intel.com
2025-08-27 11:49:00 -07:00
Stuart Summers
76186a253a drm/xe: Cancel pending TLB inval workers on teardown
Add a new _fini() routine on the GT TLB invalidation
side to handle this worker cleanup on driver teardown.

v2: Move the TLB teardown to the gt fini() routine called during
    gt_init rather than in gt_alloc. This way the GT structure stays
    alive for while we reset the TLB state.

Signed-off-by: Stuart Summers <stuart.summers@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Link: https://lore.kernel.org/r/20250826182911.392550-3-stuart.summers@intel.com
2025-08-27 11:48:56 -07:00
Stuart Summers
ce5059bf85 drm/xe: Move explicit CT lock in TLB invalidation sequence
Currently the CT lock is used to cover TLB invalidation
sequence number updates. In an effort to separate the GuC
back end tracking of communication with the firmware from
the front end TLB sequence number tracking, add a new lock
here to specifically track those sequence number updates
coming in from the user.

Apart from the CT lock, we also have a pending lock to
cover both pending fences and sequence numbers received
from the back end. Those cover interrupt cases and so
it makes not to overload those with sequence numbers
coming in from new transactions. In that way, we'll employ
a mutex here.

v2: Actually add the correct lock rather than just dropping
    it... (Matt)

Signed-off-by: Stuart Summers <stuart.summers@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Link: https://lore.kernel.org/r/20250826182911.392550-2-stuart.summers@intel.com
2025-08-27 11:48:37 -07:00
Lucas De Marchi
2674f1ef29 drm/xe/configfs: Block runtime attribute changes
Although it's possible to change the attributes in runtime, they have no
effect after the driver is already bound to the device. Check for that
and return -EBUSY in that case.

This should help users understand what's going on when the behavior is
not changing even if the value from the configfs is "right", but it got
to that state too late.

Reviewed-by: Riana Tauro <riana.tauro@intel.com>
Link: https://lore.kernel.org/r/20250826153210.3068808-2-lucas.demarchi@intel.com
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2025-08-27 11:33:03 -07:00
Xin Wang
95d0883ac8 drm/xe: Ensure GT is in C0 during resumes
This patch ensures the gt will be awake for the entire duration
of the resume sequences until GuCRC takes over and GT-C6 gets
re-enabled.

Before suspending GT-C6 is kept enabled, but upon resume, GuCRC
is not yet alive to properly control the exits and some cases of
instability and corruption related to GT-C6 can be observed.

Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/4037

Suggested-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Xin Wang <x.wang@intel.com>
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/4037
Link: https://lore.kernel.org/r/20250827000633.1369890-3-x.wang@intel.com
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2025-08-27 09:31:30 -04:00
Xin Wang
1313351e71 drm/xe: make xe_gt_idle_disable_c6() handle the forcewake internally
Move forcewake_get() into xe_gt_idle_enable_c6() to streamline the
code and make it easier to use.

Suggested-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Xin Wang <x.wang@intel.com>
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Link: https://lore.kernel.org/r/20250827000633.1369890-2-x.wang@intel.com
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2025-08-27 09:31:30 -04:00
Chaitanya Kumar Borah
d738e1be2b drm/xe/wcl: Extend L3bank mask workaround
The commit 9ab440a9d0 ("drm/xe/ptl: L3bank mask is not
available on the media GT") added a workaround to ignore
the fuse register that L3 bank availability as it did not
contain valid values. Same is true for WCL therefore extend
the workaround to cover it.

Signed-off-by: Chaitanya Kumar Borah <chaitanya.kumar.borah@intel.com>
Reviewed-by: Dnyaneshwar Bhadane <dnyaneshwar.bhadane@intel.com>
Link: https://lore.kernel.org/r/20250822002512.1129144-1-chaitanya.kumar.borah@intel.com
Signed-off-by: Gustavo Sousa <gustavo.sousa@intel.com>
2025-08-26 16:30:48 -03:00
Riana Tauro
d1f51a4f95 drm/xe/xe_hw_error: Add fault injection to trigger csc error handler
Add a debugfs fault handler to trigger csc error handler that
wedges the device and enables runtime survivability mode.

v2: add debugfs only for bmg (Umesh)
v3: do not use csc_fault attribute if debugfs is not enabled
v4: rebase

Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Signed-off-by: Riana Tauro <riana.tauro@intel.com>
Reviewed-by: Raag Jadav <raag.jadav@intel.com>
Link: https://lore.kernel.org/r/20250826063419.3022216-11-riana.tauro@intel.com
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2025-08-26 10:11:34 -04:00
Riana Tauro
a7df563b45 drm/xe/xe_hw_error: Handle CSC Firmware reported Hardware errors
Add support to handle CSC firmware reported errors. When CSC firmware
errors are encoutered, a error interrupt is received by the GFX device as
a MSI interrupt.

Device Source control registers indicates the source of the error as CSC
The HEC error status register indicates that the error is firmware reported
Depending on the type of error, the error cause is written to the HEC
Firmware error register.

On encountering such CSC firmware errors, the graphics device is
non-recoverable from driver context. The only way to recover from these
errors is firmware flash.

System admin/userspace is notified of the necessity of firmware flash
with a combination of vendor-specific drm device edged uevent, dmesg logs
and runtime survivability sysfs. It is the responsiblity of the consumer
to verify all the actions and then trigger a firmware flash using tools
like fwupd.

$ udevadm monitor --property --kernel
monitor will print the received events for:
KERNEL - the kernel uevent

KERNEL[754.709341] change   /devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:01.0/0000:03:00.0/drm/card0 (drm)
ACTION=change
DEVPATH=/devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:01.0/0000:03:00.0/drm/card0
SUBSYSTEM=drm
WEDGED=vendor-specific
DEVNAME=/dev/dri/card0
DEVTYPE=drm_minor
SEQNUM=5973
MAJOR=226
MINOR=0

Logs

xe 0000:03:00.0: [drm] *ERROR* [Hardware Error]: Tile0 reported NONFATAL error 0x20000
xe 0000:03:00.0: [drm] *ERROR* [Hardware Error]: NONFATAL: HEC Uncorrected FW FD Corruption error reported, bit[2] is set
xe 0000:03:00.0: Runtime Survivability mode enabled
xe 0000:03:00.0: [drm] *ERROR* CRITICAL: Xe has declared device 0000:03:00.0 as wedged.
               IOCTLs and executions are blocked. Only a rebind may clear the failure
               Please file a _new_ bug report at https://gitlab.freedesktop.org/drm/xe/kernel/issues/new
xe 0000:03:00.0: [drm] device wedged, needs recovery
xe 0000:03:00.0: Firmware flash required, Please refer to the userspace documentation for more details!

Runtime survivability Sysfs:

/sys/bus/pci/devices/<device>/survivability_mode

v2: use vendor recovery method with
    runtime survivability (Christian, Rodrigo, Raag)
v3: move declare wedged to runtime survivability mode (Rodrigo)
v4: update commit message

Signed-off-by: Riana Tauro <riana.tauro@intel.com>
Reviewed-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
Link: https://lore.kernel.org/r/20250826063419.3022216-10-riana.tauro@intel.com
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2025-08-26 10:11:34 -04:00