linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-05-07 20:58:14 -04:00

Author	SHA1	Message	Date
Matthew Auld	dd25b995a2	drm/xe/vm: split userptr bits into separate file This will simplify compiling out the bits that depend on DRM_GPUSVM in a later patch. Without this we end up littering the code with ifdef checks, plus it becomes hard to be sure that something won't blow at runtime due to something not being initialised, even though it passed the build. Should be no functional change here. Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://lore.kernel.org/r/20250828142430.615826-16-matthew.auld@intel.com	2025-09-05 11:45:47 +01:00
Matthew Auld	83f706ecbd	drm/gpusvm: export drm_gpusvm_pages API Export get/unmap/free pages API. We also need to tweak the SVM init to allow skipping much of the unneeded parts. Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://lore.kernel.org/r/20250828142430.615826-15-matthew.auld@intel.com	2025-09-05 11:45:47 +01:00
Matthew Auld	6364afd532	drm/gpusvm: refactor core API to use pages struct Refactor the core API of get/unmap/free pages to all operate on drm_gpusvm_pages. In the next patch we want to export a simplified core API without needing fully blown svm range etc. Suggested-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://lore.kernel.org/r/20250828142430.615826-14-matthew.auld@intel.com	2025-09-05 11:45:46 +01:00
Matthew Auld	f70da6f99d	drm/gpusvm: pull out drm_gpusvm_pages substructure Pull the pages stuff from the svm range into its own substructure, with the idea of having the main pages related routines, like get_pages(), unmap_pages() and free_pages() all operating on some lower level structures, which can then be re-used for stuff like userptr. v2: - Move seq into pages struct (Matt B) v3: - Small kernel-doc fixes Suggested-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://lore.kernel.org/r/20250828142430.615826-13-matthew.auld@intel.com	2025-09-05 11:45:46 +01:00
Matthew Auld	ad70e289ed	drm/gpusvm: use more selective dma dir in get_pages() If we are only reading the memory then from the device pov the direction can be DMA_TO_DEVICE. This aligns with the xe-userptr code. Using the most restrictive data direction to represent the access is normally a good idea. Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://lore.kernel.org/r/20250828142430.615826-12-matthew.auld@intel.com	2025-09-05 11:45:46 +01:00
Matthew Auld	c50729c68a	drm/gpusvm: fix hmm_pfn_to_map_order() usage Handle the case where the hmm range partially covers a huge page (like 2M), otherwise we can potentially end up doing something nasty like mapping memory which is outside the range, and maybe not even mapped by the mm. Fix is based on the xe userptr code, which in a future patch will directly use gpusvm, so needs alignment here. v2: - Add kernel-doc (Matt B) - s/fls/ilog2/ (Thomas) Reported-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Link: https://lore.kernel.org/r/20250828142430.615826-11-matthew.auld@intel.com	2025-09-05 11:45:46 +01:00
Harish Chegondi	5952d80514	drm/xe/xe2hpg: Add Wa_18041344222 for Xe2_HPG Add Wa_18041344222 for Xe2_HPG that requires disabling the perf mode for subslice count for eustall sampling when the enabled slices are discontiguous. Bspec: 79483, 56024 Cc: Lucas De Marchi <lucas.demarchi@intel.com> Reviewed-by: Matt Atwood <matthew.s.atwood@intel.com> Signed-off-by: Harish Chegondi <harish.chegondi@intel.com> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Link: https://lore.kernel.org/r/b6a631a13a9fb7360e89d679e0797fae42d5a09e.1756855529.git.harish.chegondi@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2025-09-05 03:02:34 -07:00
Harish Chegondi	6ee8adf124	drm/xe/mcr: Make xe_gt_mcr_get_dss_steering() input gt a const Make gt, input parameter to xe_gt_mcr_get_dss_steering(), a constant. This would allow xe_gt_mcr_get_dss_steering() to be called from functions that have gt as const to struct xe_gt. Cc: Lucas De Marchi <lucas.demarchi@intel.com> Reviewed-by: Matt Atwood <matthew.s.atwood@intel.com> Signed-off-by: Harish Chegondi <harish.chegondi@intel.com> Link: https://lore.kernel.org/r/9dc621a90880f62ac8e2951afea7952277f7eb0e.1756855529.git.harish.chegondi@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2025-09-05 03:02:34 -07:00
Michal Wajdeczko	3088f485de	drm/xe/configfs: Don't expose survivability_mode if not applicable The survivability_mode attribute is applicable only for DGFX and platforms newer than BATTLEMAGE. Use .is_visible() hook to hide this attribute when above conditions are not met. Remove code that was trying to fix such configuration during the runtime. Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Lucas De Marchi <lucas.demarchi@intel.com> Cc: Riana Tauro <riana.tauro@intel.com> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Reviewed-by: Stuart Summers <stuart.summers@intel.com> Link: https://lore.kernel.org/r/20250902131744.5076-4-michal.wajdeczko@intel.com	2025-09-04 22:33:51 +02:00
Michal Wajdeczko	b076d32177	drm/xe/configfs: Prepare to filter-out configfs attributes Implement empty ops.is_visible hook to allow filtering-out any not supported attributes, as not all of them are applicable on all xe platforms. Since during creation of each new configfs directory we are looking for xe device descriptor to validate that xe driver supports given PCI device, store reference to that descriptor to allow later use while doing attribute filtering. Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Lucas De Marchi <lucas.demarchi@intel.com> Reviewed-by: Stuart Summers <stuart.summers@intel.com> Link: https://lore.kernel.org/r/20250902131744.5076-3-michal.wajdeczko@intel.com	2025-09-04 22:32:46 +02:00
Michal Wajdeczko	079a5c83db	drm/xe/configfs: Don't touch survivability_mode on fini This is a user controlled configfs attribute, we should not modify that outside the configfs attr.store() implementation. Fixes: `bc417e54e2` ("drm/xe: Enable configfs support for survivability mode") Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Lucas De Marchi <lucas.demarchi@intel.com> Cc: Riana Tauro <riana.tauro@intel.com> Reviewed-by: Stuart Summers <stuart.summers@intel.com> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Link: https://lore.kernel.org/r/20250904103521.7130-1-michal.wajdeczko@intel.com	2025-09-04 22:31:26 +02:00
Michal Wajdeczko	2506af5f81	drm/xe/guc: Set upper limit of H2G retries over CTB The GuC communication protocol allows GuC to send NO_RESPONSE_RETRY reply message to indicate that due to some interim condition it can not handle incoming H2G request and the host shall resend it. But in some cases, due to errors, this unsatisfied condition might be final and this could lead to endless retries as it was recently seen on the CI: [drm] GT0: PF: VF1 FLR didn't finish in 5000 ms (-ETIMEDOUT) [drm] GT0: PF: VF1 resource sanitizing failed (-ETIMEDOUT) [drm] GT0: PF: VF1 FLR failed! [drm:guc_ct_send_recv [xe]] GT0: H2G action 0x5503 retrying: reason 0x0 [drm:guc_ct_send_recv [xe]] GT0: H2G action 0x5503 retrying: reason 0x0 [drm:guc_ct_send_recv [xe]] GT0: H2G action 0x5503 retrying: reason 0x0 [drm:guc_ct_send_recv [xe]] GT0: H2G action 0x5503 retrying: reason 0x0 To avoid such dangerous loops allow only limited number of retries (for now 50) and add some delays (n * 5ms) to slow down the rate of resending this repeated request. Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: John Harrison <John.C.Harrison@Intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Stuart Summers <stuart.summers@intel.com> Reviewed-by: Julia Filipchuk <julia.filipchuk@intel.com> Link: https://lore.kernel.org/r/20250903223330.6408-1-michal.wajdeczko@intel.com	2025-09-04 22:24:51 +02:00
Michal Wajdeczko	a85ead6d7f	drm/xe/debugfs: Move sa_info from gt to tile directory Our drm-based suballocator is implemented per-tile so it is better to show its debug information also per-tile debugfs directory, not under per-gt directory as it is done today. To allow adding more per-tile attributes, prepare necessary helper functions, like we already did for per-gt or per-uc attributes. Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Lucas De Marchi <lucas.demarchi@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://lore.kernel.org/r/20250829201106.1263-1-michal.wajdeczko@intel.com	2025-09-04 12:45:09 +02:00
Himal Prasad Ghimiray	fece859855	drm/xe/vm: Fix error handling in xe_vm_query_vmas_attrs_ioctl() copy_to_user() returns the number of bytes not copied on failure, not a negative error code. Update the logic to return -EFAULT instead of the number of bytes to correctly signal the error. Fixes: `418807860e` ("drm/xe/uapi: Add UAPI for querying VMA count and memory attributes") Cc: Matthew Brost <matthew.brost@intel.com> Cc: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: Tejas Upadhyay <tejas.upadhyay@intel.com> Link: https://lore.kernel.org/r/20250828104933.3839825-3-himal.prasad.ghimiray@intel.com Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>	2025-09-04 11:58:21 +05:30
Himal Prasad Ghimiray	294912f93d	drm/xe: Fix indentation in xe_zap_ptes_in_madvise_range Fix misleading indentation around WRITE_ONCE in pte zap loop. No functional change intended. Fixes: `ada7486c56` ("drm/xe: Implement madvise ioctl for xe") Cc: Matthew Brost <matthew.brost@intel.com> Cc: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: Tejas Upadhyay <tejas.upadhyay@intel.com> Link: https://lore.kernel.org/r/20250828104933.3839825-2-himal.prasad.ghimiray@intel.com Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>	2025-09-04 11:58:21 +05:30
Matthew Brost	4208fac3dc	drm/xe: Add more SVM GT stats Add more SVM GT stats which give visibility to where time is spent in the SVM page fault handler. Stats include number of faults at a given size, total SVM page fault time, migration time in us, copy time in us, copy kb, get pages time in us, and bind time in us. Will help in tuning SVM for performance. v2: - Include local changes v3: - Add tlb invalidation + valid page fault + per size copy size stats v4: - Ensure gt not NULL when incrementing SVM copy stats - Normalize stats names - Use magic macros to generate increment functions for ranges v7: - Use DEF_STAT_STR (Michal) Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Francois Dugast <francois.dugast@intel.com> Link: https://lore.kernel.org/r/20250829172232.1308004-3-matthew.brost@intel.com	2025-09-02 22:23:08 -07:00
Matthew Brost	56e6d56885	drm/xe: Add clearing stats to GT debugfs It helpful to clear GT stats, run a test cases which is being profiled, and look at the results of the stats from the individual test case. Make stats entry writable and upon write clear the stats. v5: - Drop clear_stats debugfs entry (Lucas) v6: - Use xe_gt_stats_clear rather than helper (Michal) - Rework loop in xe_gt_stats_clear (Michal) Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Link: https://lore.kernel.org/r/20250829172232.1308004-2-matthew.brost@intel.com	2025-09-02 22:23:07 -07:00
Tangudu Tilak Tirumalesh	8d6f16f1f0	drm/xe: Extend Wa_22021007897 to Xe3 platforms WA 22021007897 should also be applied to Graphics Versions 30.00, 30.01 and 30.03. To make it simple, simply use the range [3000, 3003] that should be ok as there isn't a 3002 and if it's added, the WA list would need to be revisited anyway. Cc: Matt Atwood <matthew.s.atwood@intel.com> Cc: Gustavo Sousa <gustavo.sousa@intel.com> Signed-off-by: Tangudu Tilak Tirumalesh <tilak.tirumalesh.tangudu@intel.com> Reviewed-by: Matt Atwood <matthew.s.atwood@intel.com> Link: https://lore.kernel.org/r/20250827-wa-22021007897-v1-1-96922eb52af4@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2025-09-02 20:10:44 -07:00
Zhanjun Dong	ad83b1da5b	drm/xe/guc: Increase GuC crash dump buffer size There are platforms already have a maximum dump size of 12KB, to avoid data truncating, increase GuC crash dump buffer size to 16KB. Signed-off-by: Zhanjun Dong <zhanjun.dong@intel.com> Reviewed-by: Stuart Summers <stuart.summers@intel.com> Signed-off-by: John Harrison <John.C.Harrison@Intel.com> Link: https://lore.kernel.org/r/20250829160427.1245732-1-zhanjun.dong@intel.com	2025-09-02 13:57:17 -07:00
Satyanarayana K V P	be5590c384	drm/xe/vf: Enable CCS save/restore only on supported GUC versions CCS save/restore is supported starting with GuC 70.48.0 (compatibility version 1.23.0). Gate the feature on the GuC firmware version and keep it disabled on older or unsupported versions. Fixes: `f3009272ff` ("drm/xe/vf: Create contexts for CCS read write") Signed-off-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com> Cc: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Cc: Andi Shyti <andi.shyti@kernel.org> Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Link: https://lore.kernel.org/r/20250902103256.21658-2-satyanarayana.k.v.p@intel.com	2025-09-02 18:59:17 +02:00
Satyanarayana K V P	ee4b32220a	drm/xe/guc: Add devm release action to safely tear down CT When a buffer object (BO) is allocated with the XE_BO_FLAG_GGTT_INVALIDATE flag, the driver initiates TLB invalidation requests via the CTB mechanism while releasing the BO. However a premature release of the CTB BO can lead to system crashes, as observed in: Oops: Oops: 0000 [#1] SMP NOPTI RIP: 0010:h2g_write+0x2f3/0x7c0 [xe] Call Trace: guc_ct_send_locked+0x8b/0x670 [xe] xe_guc_ct_send_locked+0x19/0x60 [xe] send_tlb_invalidation+0xb4/0x460 [xe] xe_gt_tlb_invalidation_ggtt+0x15e/0x2e0 [xe] ggtt_invalidate_gt_tlb.part.0+0x16/0x90 [xe] ggtt_node_remove+0x110/0x140 [xe] xe_ggtt_node_remove+0x40/0xa0 [xe] xe_ggtt_remove_bo+0x87/0x250 [xe] Introduce a devm-managed release action during xe_guc_ct_init() and xe_guc_ct_init_post_hwconfig() to ensure proper CTB disablement before resource deallocation, preventing the use-after-free scenario. Signed-off-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com> Cc: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Matthew Auld <matthew.auld@intel.com> Cc: Summers Stuart <stuart.summers@intel.com> Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Link: https://lore.kernel.org/r/20250901072541.31461-1-satyanarayana.k.v.p@intel.com	2025-09-02 08:21:58 +02:00
Thomas Hellström	1047bd8279	drm/xe: Fix incorrect migration of backed-up object to VRAM If an object is backed up to shmem it is incorrectly identified as not having valid data by the move code. This means moving to VRAM skips the -EMULTIHOP step and the bo is cleared. This causes all sorts of weird behaviour on DGFX if an already evicted object is targeted by the shrinker. Fix this by using ttm_tt_is_swapped() to identify backed-up objects. Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/5996 Fixes: `00c8efc318` ("drm/xe: Add a shrinker for xe bos") Cc: Matthew Brost <matthew.brost@intel.com> Cc: Matthew Auld <matthew.auld@intel.com> Cc: <stable@vger.kernel.org> # v6.15+ Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Reviewed-by: Matthew Auld <matthew.auld@intel.com> Link: https://lore.kernel.org/r/20250828134837.5709-1-thomas.hellstrom@linux.intel.com	2025-08-29 15:30:57 +02:00
Himal Prasad Ghimiray	eeb8117f5f	drm/xe/uapi: Fix kernel-doc formatting for madvise and vma_query Correct kernel-doc formatting issues in the UAPI definitions for madvise and VMA query interfaces to resolve docutils warnings during documentation build. Fixes: `418807860e` ("drm/xe/uapi: Add UAPI for querying VMA count and memory attributes") Fixes: `231bb0ee7a` ("drm/xe/uapi: Add madvise interface") Cc: Matthew Brost <matthew.brost@intel.com> Cc: Lucas De Marchi <lucas.demarchi@intel.com> Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Link: https://lore.kernel.org/r/20250828071516.3838110-1-himal.prasad.ghimiray@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2025-08-28 07:08:08 -07:00
Riana Tauro	ed906b7076	drm/xe/nvm: Use root tile mmio To allow initialization of nvm during early probe for future usecases, use root tile instead of root gt to access mmios, as gt is not yet initialized at early probe. v2: fix commit message (Lucas) Signed-off-by: Riana Tauro <riana.tauro@intel.com> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Link: https://lore.kernel.org/r/20250825103537.2551837-1-riana.tauro@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2025-08-28 06:33:05 -07:00
Marcin Bernatowicz	12267f831f	drm/xe/tests: Make cross-device dma-buf BOs CPU-visible on small BAR Small-BAR systems (e.g., SR-IOV VFs in VMs) expose only a subset of VRAM via PCI/BAR. Exporting a BO outside that window fails, and the selftests also do CPU fill/verify. Set XE_BO_FLAG_NEEDS_CPU_ACCESS for cross-device variants to force CPU-mappable placement and keep tests reliable. Large-BAR/P2P setups are unaffected. Suggested-by: Matthew Auld <matthew.auld@intel.com> Signed-off-by: Marcin Bernatowicz <marcin.bernatowicz@linux.intel.com> Cc: Matthew Auld <matthew.auld@intel.com> Reviewed-by: Matthew Auld <matthew.auld@intel.com> Link: https://lore.kernel.org/r/20250814145950.430231-1-marcin.bernatowicz@linux.intel.com Signed-off-by: Michał Winiarski <michal.winiarski@intel.com>	2025-08-28 13:21:20 +02:00
Matthew Auld	81a45cb7ea	drm/xe/migrate: make MI_TLB_INVALIDATE conditional When clearing VRAM we should be able to skip invalidating the TLBs if we are only using the identity map to access VRAM (which is the common case), since no modifications are made to PTEs on the fly. Also since we use huge 1G entries within the identity map, there should be a pretty decent chance that the next packet(s) (if also clears) can avoid a tree walk if we don't shoot down the TLBs, like if we have to process a long stream of clears. For normal moves/copies, we usually always end up with the src or dst being system memory, meaning we can't only rely on the identity map and will also need to emit PTEs and so will always require a TLB flush. v2: - Update commit to explain the situation for normal copies (Matt B) - Rebase on latest changes Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://lore.kernel.org/r/20250808110452.467513-2-matthew.auld@intel.com	2025-08-28 09:58:19 +01:00
Matthew Brost	db16f9d90c	drm/xe: Split TLB invalidation code in frontend and backend The frontend exposes an API to the driver to send invalidations, handles sequence number assignment, synchronization (fences), and provides a timeout mechanism. The backend issues the actual invalidation to the hardware (or firmware). The new layering easily allows issuing TLB invalidations to different hardware or firmware interfaces. Normalize some naming while here too. Signed-off-by: Stuart Summers <stuart.summers@intel.com> Reviewed-by: Stuart Summers <stuart.summers@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Link: https://lore.kernel.org/r/20250826182911.392550-10-stuart.summers@intel.com	2025-08-27 11:49:31 -07:00
Matthew Brost	8443e8c448	drm/xe: Add helpers to send TLB invalidations Break out the GuC specific code into helpers as part of the process to decouple frontback TLB invalidation code from the backend. Signed-off-by: Stuart Summers <stuart.summers@intel.com> Reviewed-by: Stuart Summers <stuart.summers@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Link: https://lore.kernel.org/r/20250826182911.392550-9-stuart.summers@intel.com	2025-08-27 11:49:27 -07:00
Matthew Brost	9aff63cf37	drm/xe: Prep TLB invalidation fence before sending It is a bit backwards to add a TLB invalidation fence to the pending list after issuing the invalidation. Perform this step before issuing the TLB invalidation in a helper function. v2: Make sure the seqno_lock mutex covers the send as well (Matt) Signed-off-by: Stuart Summers <stuart.summers@intel.com> Reviewed-by: Stuart Summers <stuart.summers@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Link: https://lore.kernel.org/r/20250826182911.392550-8-stuart.summers@intel.com	2025-08-27 11:49:24 -07:00
Matthew Brost	15366239e2	drm/xe: Decouple TLB invalidations from GT Decouple TLB invalidations from the GT by updating the TLB invalidation layer to accept a `struct xe_tlb_inval` instead of a `struct xe_gt`. Also, rename gt_tlb to tlb. The internals of the TLB invalidation code still operate on a GT, but this is now hidden from the rest of the driver. Signed-off-by: Stuart Summers <stuart.summers@intel.com> Reviewed-by: Stuart Summers <stuart.summers@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Link: https://lore.kernel.org/r/20250826182911.392550-7-stuart.summers@intel.com	2025-08-27 11:49:18 -07:00
Matthew Brost	6d1e452e09	drm/xe: Add xe_gt_tlb_invalidation_done_handler Decouple GT TLB seqno handling from G2H handler. v2: - Add kernel doc Reviewed-by: Stuart Summers <stuart.summers@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Link: https://lore.kernel.org/r/20250826182911.392550-6-stuart.summers@intel.com	2025-08-27 11:49:13 -07:00
Matthew Brost	594bb930fc	drm/xe: Add xe_tlb_inval structure Extract TLB invalidation state into a structure to decouple TLB invalidations from the GT, allowing the structure to be embedded anywhere in the driver. Signed-off-by: Stuart Summers <stuart.summers@intel.com> Reviewed-by: Stuart Summers <stuart.summers@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Link: https://lore.kernel.org/r/20250826182911.392550-5-stuart.summers@intel.com	2025-08-27 11:49:08 -07:00
Matthew Brost	c697ddcf27	drm/xe: s/tlb_invalidation/tlb_inval tlb_invalidation is a bit verbose leading to ugly wraps in the code, shorten to tlb_inval. Signed-off-by: Stuart Summers <stuart.summers@intel.com> Reviewed-by: Stuart Summers <stuart.summers@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Link: https://lore.kernel.org/r/20250826182911.392550-4-stuart.summers@intel.com	2025-08-27 11:49:00 -07:00
Stuart Summers	76186a253a	drm/xe: Cancel pending TLB inval workers on teardown Add a new _fini() routine on the GT TLB invalidation side to handle this worker cleanup on driver teardown. v2: Move the TLB teardown to the gt fini() routine called during gt_init rather than in gt_alloc. This way the GT structure stays alive for while we reset the TLB state. Signed-off-by: Stuart Summers <stuart.summers@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Link: https://lore.kernel.org/r/20250826182911.392550-3-stuart.summers@intel.com	2025-08-27 11:48:56 -07:00
Stuart Summers	ce5059bf85	drm/xe: Move explicit CT lock in TLB invalidation sequence Currently the CT lock is used to cover TLB invalidation sequence number updates. In an effort to separate the GuC back end tracking of communication with the firmware from the front end TLB sequence number tracking, add a new lock here to specifically track those sequence number updates coming in from the user. Apart from the CT lock, we also have a pending lock to cover both pending fences and sequence numbers received from the back end. Those cover interrupt cases and so it makes not to overload those with sequence numbers coming in from new transactions. In that way, we'll employ a mutex here. v2: Actually add the correct lock rather than just dropping it... (Matt) Signed-off-by: Stuart Summers <stuart.summers@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Link: https://lore.kernel.org/r/20250826182911.392550-2-stuart.summers@intel.com	2025-08-27 11:48:37 -07:00
Lucas De Marchi	2674f1ef29	drm/xe/configfs: Block runtime attribute changes Although it's possible to change the attributes in runtime, they have no effect after the driver is already bound to the device. Check for that and return -EBUSY in that case. This should help users understand what's going on when the behavior is not changing even if the value from the configfs is "right", but it got to that state too late. Reviewed-by: Riana Tauro <riana.tauro@intel.com> Link: https://lore.kernel.org/r/20250826153210.3068808-2-lucas.demarchi@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2025-08-27 11:33:03 -07:00
Xin Wang	95d0883ac8	drm/xe: Ensure GT is in C0 during resumes This patch ensures the gt will be awake for the entire duration of the resume sequences until GuCRC takes over and GT-C6 gets re-enabled. Before suspending GT-C6 is kept enabled, but upon resume, GuCRC is not yet alive to properly control the exits and some cases of instability and corruption related to GT-C6 can be observed. Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/4037 Suggested-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Xin Wang <x.wang@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/4037 Link: https://lore.kernel.org/r/20250827000633.1369890-3-x.wang@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2025-08-27 09:31:30 -04:00
Xin Wang	1313351e71	drm/xe: make xe_gt_idle_disable_c6() handle the forcewake internally Move forcewake_get() into xe_gt_idle_enable_c6() to streamline the code and make it easier to use. Suggested-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Xin Wang <x.wang@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://lore.kernel.org/r/20250827000633.1369890-2-x.wang@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2025-08-27 09:31:30 -04:00
Chaitanya Kumar Borah	d738e1be2b	drm/xe/wcl: Extend L3bank mask workaround The commit `9ab440a9d0` ("drm/xe/ptl: L3bank mask is not available on the media GT") added a workaround to ignore the fuse register that L3 bank availability as it did not contain valid values. Same is true for WCL therefore extend the workaround to cover it. Signed-off-by: Chaitanya Kumar Borah <chaitanya.kumar.borah@intel.com> Reviewed-by: Dnyaneshwar Bhadane <dnyaneshwar.bhadane@intel.com> Link: https://lore.kernel.org/r/20250822002512.1129144-1-chaitanya.kumar.borah@intel.com Signed-off-by: Gustavo Sousa <gustavo.sousa@intel.com>	2025-08-26 16:30:48 -03:00
Riana Tauro	d1f51a4f95	drm/xe/xe_hw_error: Add fault injection to trigger csc error handler Add a debugfs fault handler to trigger csc error handler that wedges the device and enables runtime survivability mode. v2: add debugfs only for bmg (Umesh) v3: do not use csc_fault attribute if debugfs is not enabled v4: rebase Cc: Lucas De Marchi <lucas.demarchi@intel.com> Signed-off-by: Riana Tauro <riana.tauro@intel.com> Reviewed-by: Raag Jadav <raag.jadav@intel.com> Link: https://lore.kernel.org/r/20250826063419.3022216-11-riana.tauro@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2025-08-26 10:11:34 -04:00
Riana Tauro	a7df563b45	drm/xe/xe_hw_error: Handle CSC Firmware reported Hardware errors Add support to handle CSC firmware reported errors. When CSC firmware errors are encoutered, a error interrupt is received by the GFX device as a MSI interrupt. Device Source control registers indicates the source of the error as CSC The HEC error status register indicates that the error is firmware reported Depending on the type of error, the error cause is written to the HEC Firmware error register. On encountering such CSC firmware errors, the graphics device is non-recoverable from driver context. The only way to recover from these errors is firmware flash. System admin/userspace is notified of the necessity of firmware flash with a combination of vendor-specific drm device edged uevent, dmesg logs and runtime survivability sysfs. It is the responsiblity of the consumer to verify all the actions and then trigger a firmware flash using tools like fwupd. $ udevadm monitor --property --kernel monitor will print the received events for: KERNEL - the kernel uevent KERNEL[754.709341] change /devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:01.0/0000:03:00.0/drm/card0 (drm) ACTION=change DEVPATH=/devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:01.0/0000:03:00.0/drm/card0 SUBSYSTEM=drm WEDGED=vendor-specific DEVNAME=/dev/dri/card0 DEVTYPE=drm_minor SEQNUM=5973 MAJOR=226 MINOR=0 Logs xe 0000:03:00.0: [drm] ERROR [Hardware Error]: Tile0 reported NONFATAL error 0x20000 xe 0000:03:00.0: [drm] ERROR [Hardware Error]: NONFATAL: HEC Uncorrected FW FD Corruption error reported, bit[2] is set xe 0000:03:00.0: Runtime Survivability mode enabled xe 0000:03:00.0: [drm] ERROR CRITICAL: Xe has declared device 0000:03:00.0 as wedged. IOCTLs and executions are blocked. Only a rebind may clear the failure Please file a _new_ bug report at https://gitlab.freedesktop.org/drm/xe/kernel/issues/new xe 0000:03:00.0: [drm] device wedged, needs recovery xe 0000:03:00.0: Firmware flash required, Please refer to the userspace documentation for more details! Runtime survivability Sysfs: /sys/bus/pci/devices/<device>/survivability_mode v2: use vendor recovery method with runtime survivability (Christian, Rodrigo, Raag) v3: move declare wedged to runtime survivability mode (Rodrigo) v4: update commit message Signed-off-by: Riana Tauro <riana.tauro@intel.com> Reviewed-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com> Link: https://lore.kernel.org/r/20250826063419.3022216-10-riana.tauro@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2025-08-26 10:11:34 -04:00
Riana Tauro	0a2a873d61	drm/xe: Add support to handle hardware errors Gfx device reports two classes of errors: uncorrectable and correctable. Depending on the severity uncorrectable errors are further classified Non-Fatal and Fatal. Correctable and Non-Fatal errors: These errors are reported as MSI. Bits in the Master Interrupt Register indicate the class of the error. The source of the error is then read from the Device Error Source Register. Fatal errors: These are reported as PCIe errors When a PCIe error is asserted, the OS will perform a SBR (Secondary Bus reset) which causes the driver to reload. The error registers are sticky and the values are maintained through SBR. Add basic support to handle these errors. Bspec: 50875, 53073, 53074, 53075, 53076 v2: Format commit message (Umesh) v3: fix documentation (Stuart) Cc: Stuart Summers <stuart.summers@intel.com> Co-developed-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com> Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com> Signed-off-by: Riana Tauro <riana.tauro@intel.com> Reviewed-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com> Link: https://lore.kernel.org/r/20250826063419.3022216-9-riana.tauro@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2025-08-26 10:11:34 -04:00
Riana Tauro	f646c9f937	drm/xe/doc: Document device wedged and runtime survivability Add documentation for vendor specific device wedged recovery method and runtime survivability. v2: fix documentation (Raag) v3: add userspace tool for firmware update (Raag) v4: use consistent documentation (Raag) v5: add more documentation Signed-off-by: Riana Tauro <riana.tauro@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Reviewed-by: Raag Jadav <raag.jadav@intel.com> Link: https://lore.kernel.org/r/20250826063419.3022216-8-riana.tauro@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2025-08-26 10:11:34 -04:00
Riana Tauro	a2ca0633a0	drm/xe/xe_survivability: Add support for Runtime survivability mode Certain runtime firmware errors can cause the device to be in a unusable state requiring a firmware flash to restore normal operation. Runtime Survivability Mode indicates firmware flash is necessary by wedging the device and exposing survivability mode sysfs. The below sysfs is an indication that device is in survivability mode /sys/bus/pci/devices/<device>/survivability_mode v2: Fix kernel-doc (Umesh) v3: Add user friendly dmesg (Frank) Signed-off-by: Riana Tauro <riana.tauro@intel.com> Reviewed-by: Raag Jadav <raag.jadav@intel.com> Link: https://lore.kernel.org/r/20250826063419.3022216-7-riana.tauro@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2025-08-26 10:11:34 -04:00
Riana Tauro	41ff795aff	drm/xe/xe_survivability: Refactor survivability mode Refactor survivability mode code to support both boot and runtime survivability. Signed-off-by: Riana Tauro <riana.tauro@intel.com> Reviewed-by: Raag Jadav <raag.jadav@intel.com> Link: https://lore.kernel.org/r/20250826063419.3022216-6-riana.tauro@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2025-08-26 10:11:34 -04:00
Riana Tauro	60439ac3f2	drm/xe: Add a helper function to set recovery method Add a helper function to set recovery method. The recovery method can be set before declaring the device wedged and sending the drm wedged uevent. If no method is set, default unbind/re-bind method will be set. v2: fix documentation (Raag) Signed-off-by: Riana Tauro <riana.tauro@intel.com> Reviewed-by: Raag Jadav <raag.jadav@intel.com> Link: https://lore.kernel.org/r/20250826063419.3022216-5-riana.tauro@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2025-08-26 10:11:34 -04:00
Riana Tauro	90fdcf5f89	drm/xe: Set GT as wedged before sending wedged uevent Userspace should be notified after setting the device as wedged. Re-order function calls to set gt wedged before sending uevent. Cc: Matthew Brost <matthew.brost@intel.com> Suggested-by: Raag Jadav <raag.jadav@intel.com> Signed-off-by: Riana Tauro <riana.tauro@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://lore.kernel.org/r/20250826063419.3022216-4-riana.tauro@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2025-08-26 10:11:34 -04:00
Riana Tauro	9c857a9d84	drm: Add a vendor-specific recovery method to drm device wedged uevent Address the need for a recovery method (firmware flash on Firmware errors) introduced in the later patches of Xe KMD. Whenever XE KMD detects a firmware error, a firmware flash is required to recover the device to normal operation. The initial proposal to use 'firmware-flash' as a recovery method was not applicable to other drivers and could cause multiple recovery methods specific to vendors to be added. To address this a more generic 'vendor-specific' method is introduced, guiding users to refer to vendor specific documentation and system logs for detailed vendor specific recovery procedure. Add a recovery method 'WEDGED=vendor-specific' for such errors. Vendors must provide additional recovery documentation if this method is used. It is the responsibility of the consumer to refer to the correct vendor specific documentation and usecase before attempting a recovery. For example: If driver is XE KMD, the consumer must refer to the documentation of 'Device Wedging' under 'Documentation/gpu/xe/'. v2: fix documentation (Raag) v3: add more details to commit message (Sima, Rodrigo, Raag) add an example script to the documentation (Raag) v4: use consistent naming (Raag) v5: fix commit message v6: add more documentation Cc: André Almeida <andrealmeid@igalia.com> Cc: Christian König <christian.koenig@amd.com> Cc: David Airlie <airlied@gmail.com> Cc: Simona Vetter <simona.vetter@ffwll.ch> Cc: Maxime Ripard <mripard@kernel.org> Signed-off-by: Riana Tauro <riana.tauro@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Acked-by: Maxime Ripard <mripard@kernel.org> Link: https://lore.kernel.org/r/20250826063419.3022216-3-riana.tauro@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2025-08-26 10:11:34 -04:00
Riana Tauro	38fc73b8c7	drm/xe: Add documentation for Xe Device Wedging Add documentation for Xe Device Wedging so that file can be referenced in following patches. Signed-off-by: Riana Tauro <riana.tauro@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://lore.kernel.org/r/20250826063419.3022216-2-riana.tauro@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2025-08-26 10:11:34 -04:00
Himal Prasad Ghimiray	418807860e	drm/xe/uapi: Add UAPI for querying VMA count and memory attributes Introduce the DRM_IOCTL_XE_VM_QUERY_MEMORY_RANGE_ATTRS ioctl to allow userspace to query memory attributes of VMAs within a user specified virtual address range. Userspace first calls the ioctl with num_mem_ranges = 0, sizeof_mem_ranges_attr = 0 and vector_of_vma_mem_attr = NULL to retrieve the number of memory ranges (vmas) and size of each memory range attribute. Then, it allocates a buffer of that size and calls the ioctl again to fill the buffer with memory range attributes. This two-step interface allows userspace to first query the required buffer size, then retrieve detailed attributes efficiently. v2 (Matthew Brost) - Use same ioctl to overload functionality v3 - Add kernel-doc v4 - Make uapi future proof by passing struct size (Matthew Brost) - make lock interruptible (Matthew Brost) - set reserved bits to zero (Matthew Brost) - s/__copy_to_user/copy_to_user (Matthew Brost) - Avod using VMA term in uapi (Thomas) - xe_vm_put(vm) is missing (Shuicheng) v5 - Nits - Fix kernel-doc Cc: Matthew Brost <matthew.brost@intel.com> Cc: Shuicheng Lin <shuicheng.lin@intel.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://lore.kernel.org/r/20250821173104.3030148-21-himal.prasad.ghimiray@intel.com Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>	2025-08-26 11:25:36 +05:30

1 2 3 4 5 ...

1382231 Commits