Commit Graph

1352048 Commits

Author SHA1 Message Date
Karthik Poosa
25e963a09e drm/xe/hwmon: Move card reactive critical power under channel card
Move power2/curr2_crit to channel 1 i.e power1/curr1_crit as this
represents the entire card critical power/current.

v2: Update the date of curr1_crit also in hwmon documentation.

Signed-off-by: Karthik Poosa <karthik.poosa@intel.com>
Fixes: 345dadc4f6 ("drm/xe/hwmon: Add infra to support card power and energy attributes")
Reviewed-by: Badal Nilawar <badal.nilawar@intel.com>
Link: https://lore.kernel.org/r/20250529163458.2354509-3-karthik.poosa@intel.com
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2025-05-30 11:30:01 -04:00
Karthik Poosa
7596d839f6 drm/xe/hwmon: Add support to manage power limits though mailbox
Add support to manage power limits using pcode mailbox commands
for supported platforms.

v2:
 - Address review comments. (Badal)
 - Use mailbox commands instead of registers to manage power limits
   for BMG.
 - Clamp the maximum power limit to GPU firmware default value.

v3:
 - Clamp power limit in write also for platforms with mailbox support.

v4:
 - Remove unnecessary debug prints. (Badal)

v5:
 - Update description of variable pl1_on_boot to fix kernel-doc error.

v6:
 - Improve commit message, refer to BIOS as GPU firmware.
 - Change macro READ_PL_FROM_BIOS to READ_PL_FROM_FW.
 - Rectify drm_warn to drm_info.

Signed-off-by: Karthik Poosa <karthik.poosa@intel.com>
Fixes: e90f7a58e6 ("drm/xe/hwmon: Add HWMON support for BMG")
Reviewed-by: Badal Nilawar <badal.nilawar@intel.com>
Link: https://lore.kernel.org/r/20250529163458.2354509-2-karthik.poosa@intel.com
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2025-05-30 11:30:01 -04:00
Matthew Brost
1a524e8b48 drm/xe: Do not warn on SVM migration failing because of 64k requirements
On platforms which only support 64k VRAM pages, it is expected that 4k
faults will not migrate. Do not warn on this, rather print a debug
message.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Link: https://lore.kernel.org/r/20250529164338.1745515-1-matthew.brost@intel.com
2025-05-29 21:52:15 -07:00
Balasubramani Vivekanandan
241cc827c0 drm/xe/mocs: Initialize MOCS index early
MOCS uc_index is used even before it is initialized in the following
callstack
    guc_prepare_xfer()
    __xe_guc_upload()
    xe_guc_min_load_for_hwconfig()
    xe_uc_init_hwconfig()
    xe_gt_init_hwconfig()

Do MOCS index initialization earlier in the device probe.

Signed-off-by: Balasubramani Vivekanandan <balasubramani.vivekanandan@intel.com>
Reviewed-by: Ravi Kumar Vodapalli <ravi.kumar.vodapalli@intel.com>
Link: https://lore.kernel.org/r/20250520142445.2792824-1-balasubramani.vivekanandan@intel.com
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
2025-05-29 14:29:18 -07:00
Niranjana Vishwanathapura
fbeaad071a drm/xe: Create LRC BO without VM
Specifying VM during lrc->bo creation requires VM's reference
to be held for the lifetime of lrc->bo as it will use VM's dma
reservation object. Using VM's dma reservation object for
lrc->bo doesn't provide any advantage. Hence do not pass VM
while creating lrc->bo.

v2: Use xe_bo_unpin_map_no_vm (Matthew Brost)

Fixes: 264eecdba2 ("drm/xe: Decouple xe_exec_queue and xe_lrc")
Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Link: https://lore.kernel.org/r/20250529052031.2429120-2-niranjana.vishwanathapura@intel.com
2025-05-29 09:18:31 -07:00
Matthew Auld
4f296d77cf drm/xe/vm: move xe_svm_init() earlier
In xe_vm_close_and_put() we need to be able to call xe_svm_fini(),
however during vm creation we can call this on the error path, before
having actually initialised the svm state, leading to various splats
followed by a fatal NPD.

Fixes: 6fd979c2f3 ("drm/xe: Add SVM init / close / fini to faulting VMs")
Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/4967
Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://lore.kernel.org/r/20250514152424.149591-4-matthew.auld@intel.com
2025-05-29 11:56:03 +01:00
Matthew Auld
96af397aa1 drm/xe/vm: move rebind_work init earlier
In xe_vm_close_and_put() we need to be able to call
flush_work(rebind_work), however during vm creation we can call this on
the error path, before having actually set up the worker, leading to a
splat from flush_work().

It looks like we can simply move the worker init step earlier to fix
this.

Fixes: dd08ebf6c3 ("drm/xe: Introduce a new DRM driver for Intel GPUs")
Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: <stable@vger.kernel.org> # v6.8+
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://lore.kernel.org/r/20250514152424.149591-3-matthew.auld@intel.com
2025-05-29 11:56:01 +01:00
Matthew Auld
338ec84dee drm/xe/bo: optimise CCS case for WB pages
Dealing with CCS state is significant on LNL+, where we end up clearing
the compression state on every page alloc using the blitter for user
buffers, including also saving and restoring it when moving between
domains, plus we need to alloc extra pages to hold the raw CCS state for
the save step.

However all compression PAT modes, on platforms like LNL, also require
coh_none, meaning that only WC memory can use compression in the first
place. With this we can be sneaky and completely ignore CCS for WB
buffers, which is likely the common case anyway. This would then skip
all blitter moves/clears between sys <-> tt and then also means we can
drop the extra CCS pages.

This should be safe since there is no way to interact with the
compression state (potentially uncleared) without using a PAT enabled
index (which is rejected at bind), including if trying to be malicious
and copy the raw CCS state from userpace, which should give back all
zeroes if the src surface (indirect) is lacking compressed PAT index.

Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: José Roberto de Souza <jose.souza@intel.com>
Reviewed-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>
Link: https://lore.kernel.org/r/20250516153810.223530-2-matthew.auld@intel.com
2025-05-29 11:54:45 +01:00
Michal Wajdeczko
2cb38bb0ad drm/xe: Allow to trigger GT resets using debugfs writes
Today we allow to trigger GT resest by reading dedicated debugfs
files "force_reset" and "force_reset_sync" that we are exposing
using drm_info_list[] and drm_debugfs_create_files().

To avoid triggering potentially disruptive actions during otherwise
"safe" read operations, expose those two attributes using debugfs
function where we can specify file permissions and provide custom
"write" handler to trigger the GT resets also from there.

This step would allow us to drop triggering GT resets during read
operations, which we leave just to give users more time to switch.

Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Link: https://lore.kernel.org/r/20250519200914.216-1-michal.wajdeczko@intel.com
2025-05-28 20:13:18 +02:00
Himal Prasad Ghimiray
22eba3be8e drm/xe/svm: Avoid duplicate eviction on get_pages() failure
xe_svm_range_get_pages() already calls drm_gpusvm_range_evict()
internally when it fails with -EOPNOTSUPP. Remove the eviction
call in the caller to prevent duplicate handling.

Fixes: e0ff0d7cf9 ("drm/xe/svm: Refactor usage of drm_gpusvm* function in xe_svm")
Cc: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://lore.kernel.org/r/20250526163907.1011529-1-himal.prasad.ghimiray@intel.com
Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
2025-05-28 15:44:34 +05:30
Rodrigo Vivi
39578fa404 drm/xe: Add missing documentation of rpa_freq
While at it, already adjust the rpe_freq frequency, to highlight
that both are calculated by PCODE at runtime.

Fixes: c6aac2fa77 ("drm/xe: Introduce the RPa information")
Cc: Vinay Belgaumkar <vinay.belgaumkar@intel.com>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Reviewed-by: Vinay Belgaumkar <vinay.belgaumkar@intel.com>
Link: https://lore.kernel.org/r/20250521165146.39616-4-rodrigo.vivi@intel.com
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2025-05-27 15:25:06 -04:00
Rodrigo Vivi
af53f0fd99 drm/xe: Make xe_gt_freq part of the Documentation
The documentation was created with the creation of the component,
however it has never been actually shown in the actual Documentation.

While doing this, fixes the identation style, to avoid new warnings
while building htmldocs.

Fixes: bef52b5c7a ("drm/xe: Create a xe_gt_freq component for raw management and sysfs")
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://lore.kernel.org/r/20250521165146.39616-3-rodrigo.vivi@intel.com
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2025-05-27 15:25:05 -04:00
Tomasz Lis
20a07782da drm/xe/vf: Fail migration recovery if fixups needed but platform not supported
The post-migration recovery needs to be fully implemented for a
specific platform in order to make continuation of workloads
possible.

New platforms introduce changes which affect the recovery procedure,
and without a clear verification of support this leads to errors
with no straight forward error message explaining the cause.

This patch fixes that issue - it introduces a message to be logged
when the current driver is known to not support the current platform.

Wedging the driver immediately also decreases the amount of
additional errors which would come afterwards if the driver continued
operation.

v2: Show the message during probe as well as during recovery; do not
  perform any recovery steps if the recovery is bound to fail
v3: Use SRIOV-specific logging, fix typos
v4: XE_DEBUG_SRIOV to XE_DEBUG check switch, to make testing more
  straightforward

Signed-off-by: Tomasz Lis <tomasz.lis@intel.com>
Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Michał Winiarski <michal.winiarski@intel.com>
Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Acked-by: Michał Winiarski <michal.winiarski@intel.com>
Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Link: https://lore.kernel.org/r/20250519230035.3143966-1-tomasz.lis@intel.com
2025-05-22 12:04:09 +02:00
Matt Atwood
49c6dc74b5 drm/xe/ptl: Update the PTL pci id table
Update to current bspec table.

Bspec: 72574

Signed-off-by: Matt Atwood <matthew.s.atwood@intel.com>
Reviewed-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
Reviewed-by: Clint Taylor <Clinton.A.Taylor@intel.com>
Link: https://lore.kernel.org/r/20250520195749.371748-1-matthew.s.atwood@intel.com
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
2025-05-21 13:48:25 -07:00
Shuicheng Lin
d2662cf8f4 drm/xe: Use xe_mmio_read32() to read mtcfg register
The mtcfg register is a 32-bit register and should therefore be
accessed using xe_mmio_read32().

Other 3 changes per codestyle suggestion:
"
xe_mmio.c:83: CHECK: Alignment should match open parenthesis
xe_mmio.c:131: CHECK: Comparison to NULL could be written "!xe->mmio.regs"
xe_mmio.c:315: CHECK: line length of 103 exceeds 100 columns
"

Fixes: dd08ebf6c3 ("drm/xe: Introduce a new DRM driver for Intel GPUs")
Reviewed-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
Cc: Matt Roper <matthew.d.roper@intel.com>
Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com>
Link: https://lore.kernel.org/r/20250513153010.3464767-1-shuicheng.lin@intel.com
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
2025-05-19 12:18:56 -07:00
Aradhya Bhatia
a7f87deac2 drm/xe: Default auto_link_downgrade status to false
xe_pcode_read() can return back successfully without updating the
variable 'val'. This can cause an arbitrary value to show up in the
sysfs file.

Allow the auto_link_downgrade_status to default to 0 to avoid any
arbitrary value from coming up.

Fixes: 0e414bf7ad ("drm/xe: Expose PCIe link downgrade attributes")
Reviewed-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
Signed-off-by: Aradhya Bhatia <aradhya.bhatia@intel.com>
Link: https://lore.kernel.org/r/20250516124355.4872-1-aradhya.bhatia@intel.com
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
2025-05-19 09:33:52 -07:00
Aradhya Bhatia
17486cf3df drm/xe/guc: Make creation of SLPC debugfs files conditional
Platforms that do not support SLPC are exempted from the GuC PC support.
The GuC PC does not get initialized, and neither do its BOs get created.

This causes a problem because the GuC PC debugfs file is still being
created. Whenever the file is attempted to read, it causes a NULL
pointer dereference on the supposed BO of the GuC PC.

So, make the creation of SLPC debugfs files conditional to when SLPC
features are supported.

Fixes: aaab5404b1 ("drm/xe: Introduce GuC PC debugfs")
Suggested-by: Matt Roper <matthew.d.roper@intel.com>
Reviewed-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
Reviewed-by: Stuart Summers <stuart.summers@intel.com>
Signed-off-by: Aradhya Bhatia <aradhya.bhatia@intel.com>
Link: https://lore.kernel.org/r/20250516141902.5614-1-aradhya.bhatia@intel.com
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
2025-05-19 09:00:05 -07:00
Tejas Upadhyay
a383cf218e drm/xe/mocs: Check if all domains awake
Check if all domains are awake specially for
LNCF regs

Fixes: 1182bc74b3 ("drm/xe: Fix MOCS debugfs LNCF readout")
Improvements-suggested-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Reviewed-by: Badal Nilawar <badal.nilawar@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20250506142300.1865783-1-tejas.upadhyay@intel.com
Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
2025-05-16 16:51:50 +05:30
Piotr Piórkowski
921ddb37d8 drm/xe/pf: Don't allow LMEM provisioning if LMTT isn't available on the device
The LMEM provisioning is applicable only on platforms with LMTT.

v2:
 - new commit description
 - use xe_gt_assert in xe_gt_sriov_pf_config_set_lmem instead return
   error,
 - disable pf_lmem_info if LMTT is not available
v3: fix condition in xe_gt_assert
v4: rebase

Signed-off-by: Piotr Piórkowski <piotr.piorkowski@intel.com>
Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
Reviewed-by: Stuart Summers <stuart.summers@intel.com>
Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Link: https://lore.kernel.org/r/20250513071321.700464-1-piotr.piorkowski@intel.com
2025-05-16 13:11:01 +02:00
John Harrison
16b7e65d29 drm/xe/guc: Track FAST_REQ H2Gs to report where errors came from
Most H2G messages are FAST_REQ which means no synchronous response is
expected. The messages are sent as fire-and-forget with no tracking.
However, errors can still be returned when something goes unexpectedly
wrong. That leads to confusion due to not being able to match up the
error response to the originating H2G.

So add support for tracking the FAST_REQ H2Gs and matching up an error
response to its originator. This is only enabled in XE_DEBUG builds
given that such errors should never happen in a working system and
there is an overhead for the tracking.

Further, if XE_DEBUG_GUC is enabled then even more memory and time is
used to record the call stack of each H2G and report that with the
error. That makes it much easier to work out where a specific H2G came
from if there are multiple code paths that can send it.

v2: Some re-wording of comments and prints, more consistent use of #if
vs stub functions - review feedback from Daniele & Michal).
v3: Split config change to separate patch, improve a debug print
(review feedback from Michal).
v4: Bunch of minor tweaks (review feedback from Michal).

Original-i915-code: Michal Wajdeczko <michal.wajdeczko@intel.com>
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Link: https://lore.kernel.org/r/20250512215324.1457009-5-John.C.Harrison@Intel.com
2025-05-15 12:27:37 -07:00
John Harrison
d7d97890e2 drm/xe/guc: Rename CONFIG_XE_LARGE_GUC_BUFFER
Rename XE_LARGE_GUC_BUFFER to XE_DEBUG_GUC to allow for more debug
only code (in subsequent patch) without adding more config defines
that each control only a single thing.

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Link: https://lore.kernel.org/r/20250512215324.1457009-4-John.C.Harrison@Intel.com
2025-05-15 12:27:36 -07:00
John Harrison
12373b30e2 drm/xe/guc: Add missing H2G error code definitions
These error codes are not actually used in the driver but it is
extremely useful to have them available to understand error messages.

v2: Add a bunch more error codes and drop 'status' from names (review
feedback by Michal W).
v3: Drop 'SUCCESS' response as meaningless in current API (review
feedback by Michal W).

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Link: https://lore.kernel.org/r/20250512215324.1457009-3-John.C.Harrison@Intel.com
2025-05-15 12:27:34 -07:00
John Harrison
fddf8cdd4b drm/xe/guc: Remove double blank line
An earlier patch moved a drm_print a few lines lower but accidentally
left a double blank line behind. So fix that.

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Link: https://lore.kernel.org/r/20250512215324.1457009-2-John.C.Harrison@Intel.com
2025-05-15 12:27:33 -07:00
Lucas De Marchi
eaa287069a drm/xe/guc_submit: Simplify and fix diff calculation
With a u32 type, there's no need to check which one is greater: the
current is always the latest and if it's less than the previous, it's
because it wrapped: just do the unsigned calculation that will lead to
the same result, or better the correct one. It fixes an off-by-one in
the wrapped calculation, however that doesn't really matter for the
timeout calculation.

Reviewed-by: Raag Jadav <raag.jadav@intel.com>
Reviewed-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://lore.kernel.org/r/20250513-time-wrap-v1-1-fba9a69a65c8@intel.com
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2025-05-15 06:14:17 -07:00
Michal Wajdeczko
3dbab383e3 drm/xe/guc: Don't allocate managed BO for each policy change
We shouldn't use xe_managed_bo_create_from_data() to allocate
temporary BO, as it will be released only on unload and every
change in wedge_mode policy will consume resources (including
precious GGTT). Instead just switchover to GuC buffer cache.

Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: John Harrison <John.C.Harrison@Intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://lore.kernel.org/r/20250512220018.172-3-michal.wajdeczko@intel.com
2025-05-15 12:29:55 +02:00
Michal Wajdeczko
b86babc9d9 drm/xe/guc: Unblock GuC buffer cache for all modes
Today we were using GuC buffer cache only in the PF mode, but
shortly we will want to use it also in native and VF mode.

Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Link: https://lore.kernel.org/r/20250512220018.172-2-michal.wajdeczko@intel.com
2025-05-15 12:29:54 +02:00
Himal Prasad Ghimiray
5aee6e33e1 drm/xe/vm: Add debug prints for SVM range prefetch
Introduce debug logs for the prefetch operation of SVM ranges.

Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://lore.kernel.org/r/20250513040228.470682-16-himal.prasad.ghimiray@intel.com
Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
2025-05-14 19:25:54 +05:30
Himal Prasad Ghimiray
09ba0a8f06 drm/xe/svm: Implement prefetch support for SVM ranges
This commit adds prefetch support for SVM ranges, utilizing the
existing ioctl vm_bind functionality to achieve this.

v2: rebase

v3:
   - use xa_for_each() instead of manual loop
   - check range is valid and in preferred location before adding to
     xarray
   - Fix naming conventions
   - Fix return condition as -ENODATA instead of -EAGAIN (Matthew Brost)
   - Handle sparsely populated cpu vma range (Matthew Brost)

v4:
   - fix end address to find next cpu vma in case of -ENOENT

v5:
   - Move find next vma logic to drm gpusvm layer
   - Avoid mixing declaration and logic

v6:
  - Use new function names
  - Move eviction logic to prefetch_ranges

v7:
  - devmem_only assigned 0
  - nit address

v8:
  - initialize ctx with 0

Cc: Matthew Brost <matthew.brost@intel.com>
Acked-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://lore.kernel.org/r/20250513040228.470682-15-himal.prasad.ghimiray@intel.com
Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
2025-05-14 19:25:54 +05:30
Himal Prasad Ghimiray
c904d4e2d7 drm/xe/svm: Add xe_svm_find_vma_start() helper
Add helper xe_svm_find_vma_start() function to determine start of cpu
vma in input range.

Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Acked-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Link: https://lore.kernel.org/r/20250513040228.470682-14-himal.prasad.ghimiray@intel.com
Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
2025-05-14 19:25:54 +05:30
Himal Prasad Ghimiray
72fa870957 drm/gpusvm: Introduce drm_gpusvm_find_vma_start() function
The drm_gpusvm_find_vma_start() function is used to determine the starting
address of a CPU VMA within a specified user range. If the range does not
contain any VMA, the function returns ULONG_MAX.

v2
- Rename function as drm_gpusvm_find_vma_start() (Matthew Brost)
- mmget/mmput

v3
- s/mmget/mmget_not_zero/

Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://lore.kernel.org/r/20250513040228.470682-13-himal.prasad.ghimiray@intel.com
Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
2025-05-14 19:25:54 +05:30
Himal Prasad Ghimiray
6275362f18 drm/xe/svm: Add xe_svm_range_validate() and xe_svm_range_migrate_to_smem()
The xe_svm_range_validate() function checks if a range is
valid and located in the desired memory region.

xe_svm_range_migrate_to_smem() checks if range have pages in devmem and
migrate them to smem.

v2
- Fix function stub in xe_svm.h
- Fix doc

v3 (Matthew Brost)
- Remove extra new line
- s/range->base.flags.has_devmem_pages/xe_svm_range_in_vram

v4 (Matthew Brost)
- s/xe_svm_range_in_vram/range->base.flags.has_devmem_pages
- Move eviction logic to separate function

Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Acked-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Link: https://lore.kernel.org/r/20250513040228.470682-12-himal.prasad.ghimiray@intel.com
Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
2025-05-14 19:25:54 +05:30
Himal Prasad Ghimiray
cc795e0410 drm/xe/svm: Make xe_svm_range_needs_migrate_to_vram() public
xe_svm_range_needs_migrate_to_vram() determines whether range needs
migration to vram or not, modify it to accept region preference parameter
too, so we can use it in prefetch too.

v2
- add assert instead of warn (Matthew Brost)

Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://lore.kernel.org/r/20250513040228.470682-11-himal.prasad.ghimiray@intel.com
Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
2025-05-14 19:25:54 +05:30
Himal Prasad Ghimiray
e0ff0d7cf9 drm/xe/svm: Refactor usage of drm_gpusvm* function in xe_svm
Define xe_svm_range_find_or_insert function wrapping
drm_gpusvm_range_find_or_insert for reusing in prefetch.

Define xe_svm_range_get_pages function wrapping
drm_gpusvm_range_get_pages for reusing in prefetch.

-v2 pass pagefault defined drm_gpu_svm context as parameter
in xe_svm_range_find_or_insert(Matthew Brost)

Cc: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Acked-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Link: https://lore.kernel.org/r/20250513040228.470682-10-himal.prasad.ghimiray@intel.com
Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
2025-05-14 19:25:54 +05:30
Himal Prasad Ghimiray
da05e5ddc6 drm/xe: Rename lookup_vma function to xe_find_vma_by_addr
This update renames the lookup_vma function to xe_vm_find_vma_by_addr and
makes it accessible externally. The function, which looks up a VMA by
its address within a specified VM, will be utilized in upcoming patches.

v2
 - Fix doc

Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Acked-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Link: https://lore.kernel.org/r/20250513040228.470682-9-himal.prasad.ghimiray@intel.com
Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
2025-05-14 19:25:54 +05:30
Himal Prasad Ghimiray
bd1d1b46fe drm/xe/vm: Add an identifier in xe_vma_ops for svm prefetch
Add a flag in xe_vma_ops to determine whether it has svm prefetch ops or
not.

v2:
 - s/false/0 (Matthew Brost)

v3:
 - s/XE_VMA_OPS_HAS_SVM_PREFETCH/XE_VMA_OPS_FLAG_HAS_SVM_PREFETCH

Suggested-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Acked-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Link: https://lore.kernel.org/r/20250513040228.470682-8-himal.prasad.ghimiray@intel.com
Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
2025-05-14 19:25:54 +05:30
Himal Prasad Ghimiray
34ebb62723 drm/xe/vm: Update xe_vma_ops_incr_pt_update_ops to take an increment value
Prefetch for SVM ranges can have more than one operation to increment,
hence modify the function to accept an increment value as input.

v2:
  - Call xe_vma_ops_incr_pt_update_ops only once for REMAP (Matthew Brost)
  - Add check for 0 ops

v3:
  - s/u8/int for inc_val and num_remap_ops (Matthew Brost)

Suggested-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Acked-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Link: https://lore.kernel.org/r/20250513040228.470682-7-himal.prasad.ghimiray@intel.com
Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
2025-05-14 19:25:54 +05:30
Himal Prasad Ghimiray
da2eb41004 drm/xe/svm: Make xe_svm_range_* end/start/size public
These functions will be used in prefetch too, therefore make them public.

v2
  - Fix kernel doc

Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Acked-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Link: https://lore.kernel.org/r/20250513040228.470682-6-himal.prasad.ghimiray@intel.com
Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
2025-05-14 19:25:53 +05:30
Himal Prasad Ghimiray
18211ff4d5 drm/xe/svm: Make to_xe_range a public function
The to_xe_range function will be used in other files. Therefore, make it
public and add kernel-doc documentation

Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Acked-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Link: https://lore.kernel.org/r/20250513040228.470682-5-himal.prasad.ghimiray@intel.com
Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
2025-05-14 19:25:53 +05:30
Himal Prasad Ghimiray
eb07c2fc10 drm/xe/svm: Helper to add tile masks to svm ranges
Introduce a helper to add tile mask of binding present and invalidated
for the range. Add a lockdep_assert to ensure it is protected by GPU SVM
notifier lock.

-v7
rebased

Suggested-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://lore.kernel.org/r/20250513040228.470682-4-himal.prasad.ghimiray@intel.com
Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
2025-05-14 19:25:53 +05:30
Himal Prasad Ghimiray
686a526dad drm/xe: Make xe_svm_alloc_vram public
This function will be used in prefetch too, hence make it public.

v2:
  - Add kernel-doc (Matthew Brost)
  - Rebase

v3:
 - Move CONFIG_DRM_XE_DEVMEM_MIRROR stub out to xe_svm.c (Matthew Brost)

Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Acked-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Link: https://lore.kernel.org/r/20250513040228.470682-3-himal.prasad.ghimiray@intel.com
Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
2025-05-14 19:25:53 +05:30
Himal Prasad Ghimiray
745df157e4 drm/xe: Introduce xe_vma_op_prefetch_range struct for prefetch of ranges
Add xe_vma_op_prefetch_range struct for svm ranges prefetching, including
an xarray of SVM range pointers, range count, and target memory region.

-v2: Fix doc

Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Acked-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Link: https://lore.kernel.org/r/20250513040228.470682-2-himal.prasad.ghimiray@intel.com
Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
2025-05-14 19:25:53 +05:30
Umesh Nerlige Ramappa
82b98cadb0 drm/xe: Add WA BB to capture active context utilization
Context Timestamp (CTX_TIMESTAMP) in the LRC accumulates the run ticks
of the context, but only gets updated when the context switches out. In
order to check how long a context has been active before it switches
out, two things are required:

(1) Determine if the context is running:

To do so, we program the WA BB to set an initial value for CTX_TIMESTAMP
in the LRC. The value chosen is 1 since 0 is the initial value when the
LRC is initialized. During a query, we just check for this value to
determine if the context is active. If the context switched out, it
would overwrite this location with the actual CTX_TIMESTAMP MMIO value.
Note that WA BB runs as the last part of the context restore, so reusing
this LRC location will not clobber anything.

(2) Calculate the time that the context has been active for:

The CTX_TIMESTAMP ticks only when the context is active. If a context is
active, we just use the CTX_TIMESTAMP MMIO as the new value of
utilization. While doing so, we need to read the CTX_TIMESTAMP MMIO
for the specific engine instance. Since we do not know which instance
the context is running on until it is scheduled, we also read the
ENGINE_ID MMIO in the WA BB and store it in the PPHSWP.

Using the above 2 instructions in a WA BB, capture active context
utilization.

v2: (Matt Brost)
- This breaks TDR, fix it by saving the CTX_TIMESTAMP register
  "drm/xe: Save CTX_TIMESTAMP mmio value instead of LRC value"
- Drop tile from LRC if using gt
  "drm/xe: Save the gt pointer in LRC and drop the tile"

v3:
- Remove helpers for bb_per_ctx_ptr (Matt)
- Add define for context active value (Matt)
- Use 64 bit CTX TIMESTAMP for platforms that support it. For platforms
  that don't, live with the rare race. (Matt, Lucas)
- Convert engine id to hwe and get the MMIO value (Lucas)
- Correct commit message on when WA BB runs (Lucas)

v4:
- s/GRAPHICS_VER(...)/xe->info.has_64bit_timestamp/ (Matt)
- Drop support for active utilization on a VF (CI failure)
- In xe_lrc_init ensure the lrc value is 0 to begin with (CI regression)

v5:
- Minor checkpatch fix
- Squash into previous commit and make TDR use 32-bit time
- Update code comment to match commit msg

Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/4532
Suggested-by: Lucas De Marchi <lucas.demarchi@intel.com>
Signed-off-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://lore.kernel.org/r/20250509161159.2173069-8-umesh.nerlige.ramappa@intel.com
2025-05-12 14:33:25 -07:00
Umesh Nerlige Ramappa
741d3ef8b8 drm/xe: Save the gt pointer in lrc and drop the tile
Save the gt pointer in the lrc so that it can used for gt based helpers.

Signed-off-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://lore.kernel.org/r/20250509161159.2173069-7-umesh.nerlige.ramappa@intel.com
2025-05-12 14:33:24 -07:00
Umesh Nerlige Ramappa
38b14233e5 drm/xe: Save CTX_TIMESTAMP mmio value instead of LRC value
For determining actual job execution time, save the current value of the
CTX_TIMESTAMP register rather than the value saved in LRC since the
current register value is the closest to the start time of the job.

v2: Define MI_STORE_REGISTER_MEM to fix compile error
v3: Place MI_STORE_REGISTER_MEM sorted by MI_INSTR (Lucas)

Fixes: 65921374c4 ("drm/xe: Emit ctx timestamp copy in ring ops")
Signed-off-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://lore.kernel.org/r/20250509161159.2173069-6-umesh.nerlige.ramappa@intel.com
2025-05-12 14:33:23 -07:00
Matthew Brost
1b894c2246 drm/xe: Add atomic_svm_timeslice_ms debugfs entry
Add some informal control for atomic SVM fault GPU timeslice to be able
to play around with values and tweak performance.

v2:
 - Reduce timeslice default value to 5ms

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Link: https://lore.kernel.org/r/20250512135500.1405019-6-matthew.brost@intel.com
2025-05-12 13:49:18 -07:00
Matthew Brost
a5d8d3be1d drm/xe: Timeslice GPU on atomic SVM fault
Ensure GPU can make forward progress on an atomic SVM GPU fault by
giving the GPU a timeslice of 5ms

v2:
 - Reduce timeslice to 5ms
 - Double timeslice on retry
 - Split out GPU SVM changes into independent patch
v5:
 - Double timeslice in a few more places

Fixes: 2f118c9491 ("drm/xe: Add SVM VRAM migration")
Cc: stable@vger.kernel.org
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Link: https://lore.kernel.org/r/20250512135500.1405019-5-matthew.brost@intel.com
2025-05-12 13:49:18 -07:00
Matthew Brost
8dc1812b5b drm/gpusvm: Add timeslicing support to GPU SVM
Add timeslicing support to GPU SVM which will guarantee the GPU a
minimum execution time on piece of physical memory before migration back
to CPU. Intended to implement strict migration policies which require
memory to be in a certain placement for correct execution.

Required for shared CPU and GPU atomics on certain devices.

Fixes: 99624bdff8 ("drm/gpusvm: Add support for GPU Shared Virtual Memory")
Cc: stable@vger.kernel.org
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Link: https://lore.kernel.org/r/20250512135500.1405019-4-matthew.brost@intel.com
2025-05-12 13:49:18 -07:00
Matthew Brost
a9ac0fa455 drm/xe: Strict migration policy for atomic SVM faults
Mixing GPU and CPU atomics does not work unless a strict migration
policy of GPU atomics must be device memory. Enforce a policy of must be
in VRAM with a retry loop of 3 attempts, if retry loop fails abort
fault.

Removing always_migrate_to_vram modparam as we now have real migration
policy.

v2:
 - Only retry migration on atomics
 - Drop alway migrate modparam
v3:
 - Only set vram_only on DGFX (Himal)
 - Bail on get_pages failure if vram_only and retry count exceeded (Himal)
 - s/vram_only/devmem_only
 - Update xe_svm_range_is_valid to accept devmem_only argument
v4:
 - Fix logic bug get_pages failure
v5:
 - Fix commit message (Himal)
 - Mention removing always_migrate_to_vram in commit message (Lucas)
 - Fix xe_svm_range_is_valid to check for devmem pages
 - Bail on devmem_only && !migrate_devmem (Thomas)
v6:
 - Add READ_ONCE barriers for opportunistic checks (Thomas)
 - Pair READ_ONCE with WRITE_ONCE (Thomas)
v7:
 - Adjust comments (Thomas)

Fixes: 2f118c9491 ("drm/xe: Add SVM VRAM migration")
Cc: stable@vger.kernel.org
Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Acked-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Link: https://lore.kernel.org/r/20250512135500.1405019-3-matthew.brost@intel.com
2025-05-12 13:49:08 -07:00
Himal Prasad Ghimiray
8a9b978ebd drm/gpusvm: Introduce devmem_only flag for allocation
This commit adds a new flag, devmem_only, to the drm_gpusvm structure. The
purpose of this flag is to ensure that the get_pages function allocates
memory exclusively from the device's memory. If the allocation from
device memory fails, the function will return an -EFAULT error.

Required for shared CPU and GPU atomics on certain devices.

v3:
 - s/vram_only/devmem_only/

Fixes: 99624bdff8 ("drm/gpusvm: Add support for GPU Shared Virtual Memory")
Cc: stable@vger.kernel.org
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://lore.kernel.org/r/20250512135500.1405019-2-matthew.brost@intel.com
2025-05-12 13:48:48 -07:00
Aradhya Bhatia
e5c13e2c50 drm/xe/xe2hpg: Add Wa_22021007897
Add Wa_22021007897 for the Xe2_HPG (graphics version: 20.01) IP. It is
a permanent workaround, and applicable on all the steppings.

Reviewed-by: Gustavo Sousa <gustavo.sousa@intel.com>
Reviewed-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
Signed-off-by: Aradhya Bhatia <aradhya.bhatia@intel.com>
Link: https://lore.kernel.org/r/20250512065004.2576-1-aradhya.bhatia@intel.com
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
2025-05-12 13:23:47 -07:00