Commit Graph

1384740 Commits

Author SHA1 Message Date
Matthew Brost
807c42dd80 drm/xe: Don't change LRC ring head on job resubmission
Now that we save the job's head during submission, it's no longer
necessary to adjust the LRC ring head during resubmission. Instead, a
software-based adjustment of the tail will overwrite the old jobs in
place. For some odd reason, adjusting the LRC ring head didn't work on
parallel queues, which was causing issues in our CI.

v5:
 - Add comment in guc_exec_queue_start explaning why the function works
   (Auld)
v7:
 - Only adjust first state on first unsignaled job (Auld)
v8:
 - Break unsignaled job handling to separate patch (Auld)

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Tomasz Lis <tomasz.lis@intel.com>
Link: https://lore.kernel.org/r/20251008214532.3442967-7-matthew.brost@intel.com
2025-10-09 03:22:21 -07:00
Matthew Brost
b00d1e3fc8 drm/xe: Return first unsignaled job first pending job helper
In all cases where the first pending job helper is called, we only want
to retrieve the first unsignaled pending job, as this helper is used
exclusively in recovery flows. It is possible for signaled jobs to
remain in the pending list as the scheduler is stopped, so those should
be skipped.

Also, add kernel documentation to clarify this behavior.

v8:
 - Split out into own patch (Auld)

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Link: https://lore.kernel.org/r/20251008214532.3442967-6-matthew.brost@intel.com
2025-10-09 03:22:20 -07:00
Matthew Brost
f6375fb3aa drm/xe: Track LR jobs in DRM scheduler pending list
VF migration requires jobs to remain pending so they can be replayed
after the VF comes back. Previously, LR job fences were intentionally
signaled immediately after submission to avoid the risk of exporting
them, as these fences do not naturally signal in a timely manner and
could break dma-fence contracts. A side effect of this approach was that
LR jobs were never added to the DRM scheduler’s pending list, preventing
them from being tracked for later resubmission.

We now avoid signaling LR job fences and ensure they are never exported;
Xe already guards against exporting these internal fences. With that
guarantee in place, we can safely track LR jobs in the scheduler’s
pending list so they are eligible for resubmission during VF
post-migration recovery (and similar recovery paths).

An added benefit is that LR queues now gain the DRM scheduler’s built-in
flow control over ring usage rather than rejecting new jobs in the exec
IOCTL if the ring is full.

v2:
 - Ensure DRM scheduler TDR doesn't run for LR jobs
 - Stack variable for killed_or_banned_or_wedged
v4:
 - Clarify commit message (Tomasz)

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Tomasz Lis <tomasz.lis@intel.com>
Link: https://lore.kernel.org/r/20251008214532.3442967-5-matthew.brost@intel.com
2025-10-09 03:22:19 -07:00
Matthew Brost
7e1fe102c8 drm/xe/guc: Track pending-enable source in submission state
Add explicit tracking in the GuC submission state to record the source
of a pending enable (TDR vs. queue resume path vs. submission).
Disambiguating the origin lets the GuC submission state machine apply
the correct recovery/replay behavior.

This helps VF restore: when the device comes back, the state machine knows
whether the pending enable stems from timeout recovery, from a queue resume
sequence, or submission and can gate sequencing and fixups accordingly.

v4:
 - Clarify commit message (Tomasz)

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Tomasz Lis <tomasz.lis@intel.com>
Link: https://lore.kernel.org/r/20251008214532.3442967-4-matthew.brost@intel.com
2025-10-09 03:22:18 -07:00
Matthew Brost
26cd498e00 drm/xe: Save off position in ring in which a job was programmed
VF post-migration recovery needs to modify the ring with updated GGTT
addresses for pending jobs. Save off position in ring in which a job was
programmed to facilitate.

v4:
 - s/VF resume/VF post-migration recovery (Tomasz)

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Tomasz Lis <tomasz.lis@intel.com>
Link: https://lore.kernel.org/r/20251008214532.3442967-3-matthew.brost@intel.com
2025-10-09 03:22:16 -07:00
Matthew Brost
b0607599b7 drm/xe: Add NULL checks to scratch LRC allocation
kmalloc can fail, the returned value must have a NULL check. This should
be immediately after kmalloc for clarity.

v5:
 - Assert state->buffer in setup_bo if buffer is iomem (Tomasz)

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Tomasz Lis <tomasz.lis@intel.com>
Link: https://lore.kernel.org/r/20251008214532.3442967-2-matthew.brost@intel.com
2025-10-09 03:22:15 -07:00
Tejas Upadhyay
15b3036045 drm/xe: Move declarations under conditional branch
The xe_device_shutdown() function was needing a few declarations
that were only required under a specific condition. This change
moves those declarations to be within that conditional branch
to avoid unnecessary declarations.

Reviewed-by: Nitin Gote <nitin.r.gote@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20251007100208.1407021-1-tejas.upadhyay@intel.com
Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
2025-10-08 15:07:41 +05:30
Michal Wajdeczko
c09a9933af drm/xe/pf: Add max_vfs configfs attribute to control PF mode
In addition to existing max_vfs modparam, add max_vfs configfs
attribute to allow PF configuration on the per-device level.
Default config value is still based on the modparam value.

Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Acked-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://lore.kernel.org/r/20251002232648.203370-1-michal.wajdeczko@intel.com
2025-10-07 23:03:44 +02:00
Michal Wajdeczko
4592e7abd2 drm/xe/pf: Improve reading VF config blob from debugfs
Due to the use of the file operation flows, we might encode the
VF config blob multiple times. Avoid that by capturing it once
during the open() operation instead of the read() operation.

Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Reviewed-by: Piotr Piórkowski <piotr.piorkowski@intel.com>
Link: https://lore.kernel.org/r/20251004162036.1800-1-michal.wajdeczko@intel.com
2025-10-07 13:00:29 +02:00
Michal Wajdeczko
0faa22e706 drm/xe/guc: Ratelimit diagnostic messages from the relay
There might be some malicious VFs that by sending an invalid VF2PF
relay messages will flood PF's dmesg with our diagnostics messages.

Rate limit all relay messages, unless running in DEBUG_SRIOV mode.

Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Reviewed-by: Piotr Piórkowski <piotr.piorkowski@intel.com>
Link: https://lore.kernel.org/r/20251005173946.2784-1-michal.wajdeczko@intel.com
2025-10-06 19:44:43 +02:00
Michal Wajdeczko
430d328877 drm/xe: Update MEMIRQ to use tile-based printk macros
We already have tile-based printk macros, there is no need to
manually prepare MEMIRQ specific messages to include tile id.

Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Reviewed-by: Piotr Piórkowski <piotr.piorkowski@intel.com>
Link: https://lore.kernel.org/r/20251005133641.2651-5-michal.wajdeczko@intel.com
2025-10-06 19:39:26 +02:00
Michal Wajdeczko
cd11babcd0 drm/xe/pf: Update LMTT to use tile-based messages
Since now we have tile-based SR-IOV printk macros, there is no
need to manually prepare the LMTT specific warning message (that
is now upgraded to proper error level message) nor to use generic
debug message without tile/LMTT identification.

Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Reviewed-by: Piotr Piórkowski <piotr.piorkowski@intel.com>
Link: https://lore.kernel.org/r/20251005133641.2651-4-michal.wajdeczko@intel.com
2025-10-06 19:39:25 +02:00
Michal Wajdeczko
c66e4b6cae drm/xe: Add tile-based SRIOV printk macros
We already have device and GT level SR-IOV specific macros, but
unlike native case, we don't have yet tile-based ones.

Add macros to match native use case and also update GT-based
macros to rely on those new tile-based SR-IOV macros. This will
slightly rearrange the output of the GT logs and instead:

  [...] Tile0: GT0: PF: pushed VF1 config with 2 KLVs...

we might see:

  [...] PF: Tile0: GT0: pushed VF1 config with 2 KLVs...

but that's even better.

Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Reviewed-by: Piotr Piórkowski <piotr.piorkowski@intel.com>
Link: https://lore.kernel.org/r/20251005133641.2651-3-michal.wajdeczko@intel.com
2025-10-06 19:39:23 +02:00
Michal Wajdeczko
c95f180207 drm/xe: Update SRIOV printk macros
Recently we introduced xe-based printk macros, use them instead
of plain drm-based ones.

Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Reviewed-by: Piotr Piórkowski <piotr.piorkowski@intel.com>
Link: https://lore.kernel.org/r/20251005133641.2651-2-michal.wajdeczko@intel.com
2025-10-06 19:39:22 +02:00
Michal Wajdeczko
9a54b5127f drm/xe/pf: Make the late-initialization really late
While the late PF per-GT initialization is done quite late in the
single GT initialization flow, in case of multi-GT platforms, it
may still be done before other GT early initialization. That leads
to some issues during unwind, when there are cross-GT dependencies,
like resource cleanup that is shared by both GTs, but the other GT
may already be sanitized or disabled.

The following errors could be observed when trying to unload the PF
driver with some LMEM/VRAM already provisioned for few VFs:

 [ ] xe 0000:03:00.0: DEVRES REL ffff88814708f240 fini_config (16 bytes)
 [ ] xe 0000:03:00.0: [drm:lmtt_write_pte [xe]] PF: LMTT: WRITE level=2 index=1 pte=0x0
 [ ] xe 0000:03:00.0: [drm:lmtt_invalidate_hw [xe]] PF: LMTT: num_fences=2 err=-19
 [ ] xe 0000:03:00.0: [drm:lmtt_pt_free [xe]] PF: LMTT: level=0 addr=53a470000
 [ ] xe 0000:03:00.0: [drm:lmtt_pt_free [xe]] PF: LMTT: level=1 addr=53a4b0000
 [ ] xe 0000:03:00.0: [drm:lmtt_invalidate_hw [xe]] PF: LMTT: num_fences=2 err=-19
 [ ] xe 0000:03:00.0: [drm] PF: LMTT0 invalidation failed (-ENODEV)
 [ ] xe 0000:03:00.0: [drm:lmtt_write_pte [xe]] PF: LMTT: WRITE level=2 index=2 pte=0x0
 [ ] xe 0000:03:00.0: [drm:lmtt_invalidate_hw [xe]] PF: LMTT: num_fences=2 err=-19
 [ ] xe 0000:03:00.0: [drm:lmtt_pt_free [xe]] PF: LMTT: level=0 addr=539b70000
 [ ] xe 0000:03:00.0: [drm:lmtt_pt_free [xe]] PF: LMTT: level=1 addr=539bf0000
 [ ] xe 0000:03:00.0: [drm:lmtt_invalidate_hw [xe]] PF: LMTT: num_fences=2 err=-19
 [ ] xe 0000:03:00.0: [drm] PF: LMTT0 invalidation failed (-ENODEV)

Move all PF per-GT late initialization to the already defined late
SR-IOV initialization function to allow proper order of the cleanup
actions.

While around, format all PF function stubs as one-liners, like many
other stubs are defined in the Xe driver.

Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com>
Reviewed-by: Piotr Piórkowski <piotr.piorkowski@intel.com>
Link: https://lore.kernel.org/r/20251004162008.1782-1-michal.wajdeczko@intel.com
2025-10-06 19:30:17 +02:00
Michal Wajdeczko
71f1939e0d drm/xe/xe_late_bind_fw: Fix and simplify parsing user input
Code was wrongly passing sizeof(uval) as the number base to use,
and unlike other debugfs entries that represent bool data, it
wasn't using the dedicated function to parse user input as bool.

Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Badal Nilawar <badal.nilawar@intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com>
Link: https://lore.kernel.org/r/20251002192736.203186-1-michal.wajdeczko@intel.com
2025-10-06 19:24:15 +02:00
Michal Wajdeczko
869580c415 drm/xe: Don't force DRM_XE_DEBUG_MEMIRQ for SR-IOV debug
For pure SR-IOV debugging there is no need to select already
separated config for the debugging of the memory based interrupts,
as the latter is also very noisy on its own. Change config order
and use a weak reverse dependency instead.

Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com>
Reviewed-by: Piotr Piórkowski <piotr.piorkowski@intel.com>
Link: https://lore.kernel.org/r/20251002171308.203127-1-michal.wajdeczko@intel.com
2025-10-06 19:11:30 +02:00
Shuicheng Lin
a908de69ce drm/xe: Fix copyright and function naming in xe_ttm_vram_mgr
- Correct copyright year from "2002" to "2022".
- Rename ttm_vram_mgr_fini() to xe_ttm_vram_mgr_fini() to avoid
  confusion with generic TTM helpers.

No functional changes intended.

Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com>
Reviewed-by: Nitin Gote <nitin.r.gote@intel.com>
Link: https://lore.kernel.org/r/20251004000425.2489291-2-shuicheng.lin@intel.com
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2025-10-06 12:46:28 -04:00
Piotr Piórkowski
8462d16d1b drm/xe: Combine userspace context check
Both vm->xef and XE_LRC_CREATE_USER_CTX indicate in xe_lrc_init that
the context originates from userspace. However, XE_LRC_CREATE_USER_CTX
has a broader scope as it may be set even when no vm->xef is present.
The XE_BO_FLAG_PINNED_LATE_RESTORE flag can be extended to both cases,
so there is no point in handling the two cases separately.
Let's combine vm->xef and XE_LRC_CREATE_USER_CTX checks to detect
userspace context.

Signed-off-by: Piotr Piórkowski <piotr.piorkowski@intel.com>
Suggested-by: Matthew Auld <matthew.auld@intel.com>
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Link: https://lore.kernel.org/r/20251003162619.1984236-6-piotr.piorkowski@intel.com
2025-10-06 08:33:52 +02:00
Piotr Piórkowski
b48140f446 drm/xe/pf: Force use user VRAM for LMEM provisioning
The LMEM assigned to VFs should be allocated from the general-purpose
VRAM pool, not from the kernel-reserved region.
Let's force the use of general-purpose VRAM for BOs intended for VFs.

Signed-off-by: Piotr Piórkowski <piotr.piorkowski@intel.com>
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Link: https://lore.kernel.org/r/20251003162619.1984236-5-piotr.piorkowski@intel.com
2025-10-06 08:33:51 +02:00
Piotr Piórkowski
3f6cd669d5 drm/xe: Force user context allocations in user VRAM
In general, kernel structures should be allocated in the kernel-dedicated
VRAM region. However, userspace context data - while used by the kernel -
does not need to reside there.
Let's force the allocation of such data in the general-purpose VRAM region
accessible to userspace.

Signed-off-by: Piotr Piórkowski <piotr.piorkowski@intel.com>
Cc: Matthew Auld <matthew.auld@intel.com>
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Link: https://lore.kernel.org/r/20251003162619.1984236-4-piotr.piorkowski@intel.com
2025-10-06 08:33:49 +02:00
Piotr Piórkowski
9d290ab0b5 drm/xe: Introduce new BO flag XE_BO_FLAG_FORCE_USER_VRAM
When using a separate VRAM region for kernel allocations,
some kernel structures, such as context userspace data,
should not reside in the VRAM region dedicated to the kernel.
The VRAM kernel region is intended only for allocations necessary
for driver operation. Allocations created via ioctl are long-lived
and not easily evictable. If this region runs out of space,
there may not be a fallback, which could cause failures.
To prevent this, add a new BO flag that explicitly forces the BO to be
allocated in the general-purpose VRAM region accessible to userspace,
avoiding the kernel-only VRAM region.

v2:
 - update commit message (Matthew)

Signed-off-by: Piotr Piórkowski <piotr.piorkowski@intel.com>
Cc: Matthew Auld <matthew.auld@intel.com>
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Link: https://lore.kernel.org/r/20251003162619.1984236-3-piotr.piorkowski@intel.com
2025-10-06 08:33:48 +02:00
Piotr Piórkowski
db7dde9904 drm/xe: Add initial support for separate kernel VRAM region on the tile
So far, kernel and userspace allocations have shared the same VRAM region.
However, in some scenarios, it may be necessary to reserve a separate
VRAM area exclusively for kernel allocations.
Let's add preliminary support for such a configuration.

v2:
- replaced for_each_bo_flag_vram with the improved
  for_each_set_bo_vram_flag helper (Matthew)
- moved the VRAM flag iteration macro definition into xe_bo.c (Matthew)
- drop unused bo_flgas from bo_vram_flags_to_vram_placement (Matthew)
- use hweight32 helper in __xe_bo_fixed_placement for readability
  (Matthew)
v3: remove unnecessary VRAM fixup id

Signed-off-by: Piotr Piórkowski <piotr.piorkowski@intel.com>
Cc: Matthew Auld <matthew.auld@intel.com>
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Link: https://lore.kernel.org/r/20251003162619.1984236-2-piotr.piorkowski@intel.com
2025-10-06 08:33:46 +02:00
Matthew Brost
bdc2fb17ae Revert "drm/xe/vf: Fixup CTB send buffer messages after migration"
This reverts commit cef88d1265.

Due to change in the VF migration recovery design this code
is not needed any more.

v3:
 - Add commit message (Michal / Lucas)

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Link: https://lore.kernel.org/r/20251002233824.203417-4-michal.wajdeczko@intel.com
2025-10-03 20:36:26 -07:00
Matthew Brost
6c640592e8 Revert "drm/xe/vf: Post migration, repopulate ring area for pending request"
This reverts commit a0dda25d24.

Due to change in the VF migration recovery design this code
is not needed any more.

v3:
 - Add commit message (Michal / Lucas)

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Link: https://lore.kernel.org/r/20251002233824.203417-3-michal.wajdeczko@intel.com
2025-10-03 20:36:24 -07:00
Matthew Brost
08c98f3f2b Revert "drm/xe/vf: Rebase exec queue parallel commands during migration recovery"
This reverts commit ba180a3621.

Due to change in the VF migration recovery design this code
is not needed any more.

v3:
 - Add commit message (Michal / Lucas)

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Link: https://lore.kernel.org/r/20251002233824.203417-2-michal.wajdeczko@intel.com
2025-10-03 20:36:23 -07:00
Michal Wajdeczko
2a8fcf7cc9 drm/xe/pf: Synchronize VF FLR between all GTs
The PF part of the VF FLR processing shall be done after all GuCs
confirm that they finished their part VF FLR processing, otherwise
PF may start clearing VF's GGTT that other GuC may still accessing.

Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Reviewed-by: Michał Winiarski <michal.winiarski@intel.com>
Link: https://lore.kernel.org/r/20250930233525.201263-7-michal.wajdeczko@intel.com
2025-10-02 23:58:35 +02:00
Michal Wajdeczko
03dc00c782 drm/xe/pf: Split VF FLR processing function
On multi-GT platforms (like PTL) we may want to run VF FLR on each
GuC (render and media) in parallel. Split our FLR function to allow
to wait for GT VF FLR completion separately.

Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Reviewed-by: Michał Winiarski <michal.winiarski@intel.com>
Link: https://lore.kernel.org/r/20250930233525.201263-6-michal.wajdeczko@intel.com
2025-10-02 23:58:33 +02:00
Michal Wajdeczko
1f018c8496 drm/xe/pf: Unify VF state tracking log
By using single function that dumps VF state transition, final
logs are easier to analyze as there is always the same call site
in every debug message.

Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Reviewed-by: Michał Winiarski <michal.winiarski@intel.com>
Link: https://lore.kernel.org/r/20250930233525.201263-5-michal.wajdeczko@intel.com
2025-10-02 23:58:32 +02:00
Michal Wajdeczko
5b7451fdd7 drm/xe/pf: Expose VF control operations over debugfs
To allow the user to control the activity of individual VFs,
expose basic VF control operations (pause, resume, stop, reset)
over the debugfs as write-only files:

  /sys/kernel/debug/dri/BDF/sriov/
  ├── vf1
  │   ├── pause
  │   ├── reset
  │   ├── resume
  │   ├── stop
  │   :
  ├── vf2
  :   :

Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Reviewed-by: Michał Winiarski <michal.winiarski@intel.com>
Link: https://lore.kernel.org/r/20250930233525.201263-4-michal.wajdeczko@intel.com
2025-10-02 23:58:31 +02:00
Michal Wajdeczko
ac43294e8e drm/xe/pf: Log only top level VF state changes
The user likely only care about top level VF state changes, any VF
state logs on the per-GT basis can be demoted to the debug level.

Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Reviewed-by: Michał Winiarski <michal.winiarski@intel.com>
Link: https://lore.kernel.org/r/20250930233525.201263-3-michal.wajdeczko@intel.com
2025-10-02 23:58:30 +02:00
Michal Wajdeczko
c97cdf7686 drm/xe/pf: Add top level functions to control VFs
We already have control functions that we use to control the VF
state on the per-GT basis, but that is low level detail from the
user point of view, who rather expects VF-level functions.

For now add simple functions that just iterate over all GTs and
call per-GT control function. We will soon allow to use some of
them from the user facing interfaces like debugfs.

Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Reviewed-by: Michał Winiarski <michal.winiarski@intel.com>
Link: https://lore.kernel.org/r/20250930233525.201263-2-michal.wajdeczko@intel.com
2025-10-02 23:58:28 +02:00
Michal Wajdeczko
846a81abbe drm/xe: Detect GT workqueue allocation failure
The allocation of the per-GT workqueue may fail and we shouldn't
ignore that.  While around use drm managed allocation function
to drop our custom fini action.

Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://lore.kernel.org/r/20251001144051.202040-1-michal.wajdeczko@intel.com
2025-10-02 18:48:10 +02:00
Niranjana Vishwanathapura
b56bc81078 drm/xe/doc: Add documentation for Execution Queues
Add documentation for Xe Execution Queues and add xe_exec_queue.rst
file.

v2: Add info about how Execution queue interfaces
    with other components in the driver (Matt Brost)

Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://lore.kernel.org/r/20251002044319.450181-2-niranjana.vishwanathapura@intel.com
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2025-10-02 08:43:07 -07:00
Raag Jadav
e4863f1159 drm/xe/i2c: Don't rely on d3cold.allowed flag in system PM path
In S3 and above sleep states, the device can loose power regardless of
d3cold.allowed flag. Bring up I2C controller explicitly in system PM
path to ensure its normal operation after losing power.

v2: Cover S3 and above states (Rodrigo)

Fixes: 0ea07b6951 ("drm/xe/pm: Wire up suspend/resume for I2C controller")
Signed-off-by: Raag Jadav <raag.jadav@intel.com>
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Link: https://lore.kernel.org/r/20250918103200.2952576-1-raag.jadav@intel.com
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2025-10-02 10:27:44 -04:00
Mallesh Koujalagi
07abc16c14 drm/xe/xe_late_bind_fw: Initialize uval variable in xe_late_bind_fw_num_fans()
Initialize the uval variable to 0 in xe_late_bind_fw_num_fans() to fix
a potential use of uninitialized variable warning and ensure predictable
behavior.

The variable is passed by reference to xe_pcode_read() which should
populate it on success, but initializing it to 0 provides a safe
default value and follows kernel coding best practices.

v2:
- uval = 0 which serves as both a safe default and the fallback
  value when the pcode read operation fails.

v3:
- Handle MMIO failure (Rodrigo)
- The function should probably return the error and make the uval as
  pointer-argument, like the pcode_read.
- Change the caller of this function to propagate the error
  upwards if mmio failed.

Fixes: 45832bf9c1 ("drm/xe/xe_late_bind_fw: Initialize late binding firmware")
Signed-off-by: Mallesh Koujalagi <mallesh.koujalagi@intel.com>
Link: https://lore.kernel.org/r/20251002005648.3185636-1-mallesh.koujalagi@intel.com
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2025-10-02 10:11:07 -04:00
Thomas Hellström
ad298d9ec9 drm/gpusvm, drm/xe: Fix userptr to not allow device private pages
When userptr is used on SVM-enabled VMs, a non-NULL
hmm_range::dev_private_owner value might mean that
hmm_range_fault() attempts to return device private pages.
Either that will fail, or the userptr code will not know
how to handle those.

Use NULL for hmm_range::dev_private_owner to migrate
such pages to system. In order to do that, move the
struct drm_gpusvm::device_private_page_owner field to
struct drm_gpusvm_ctx::device_private_page_owner so that
it doesn't remain immutable over the drm_gpusvm lifetime.

v2:
- Don't conditionally compile xe_svm_devm_owner().
- Kerneldoc xe_svm_devm_owner().

Fixes: 9e97874148 ("drm/xe/userptr: replace xe_hmm with gpusvm")
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Acked-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Link: https://lore.kernel.org/r/20250930122752.96034-1-thomas.hellstrom@linux.intel.com
2025-10-02 11:50:12 +02:00
Raag Jadav
d9c401d8f3 drm/xe/sysfs: Drop redundant runtime PM usage
The device is expected to be in D0 state during driver probe. No need to
resume it in ->is_visible() callbacks or non I/O operations.

Signed-off-by: Raag Jadav <raag.jadav@intel.com>
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Link: https://lore.kernel.org/r/20250918114804.2957177-3-raag.jadav@intel.com
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2025-09-30 07:52:45 -07:00
Raag Jadav
5a856e277b drm/xe/hwmon: Drop redundant runtime PM usage
The device is expected to be in D0 state during driver probe. No need to
resume it in ->is_visible() callbacks.

Signed-off-by: Raag Jadav <raag.jadav@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://lore.kernel.org/r/20250918114804.2957177-2-raag.jadav@intel.com
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2025-09-30 07:52:39 -07:00
Colin Ian King
20f3b28e2e drm/xe/xe_late_bind_fw: Fix missing initialization of variable offset
The variable offset is not being initialized, and it is only set inside
a for-loop if entry->name is the same as manifest_entry. In the case
where it is not initialized a non-zero check on offset is potentialy checking
a bogus uninitalized value. Fix this by initializing offset to zero.

Fixes: efa29317a5 ("drm/xe/xe_late_bind_fw: Extract and print version info")
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Reviewed-by: Badal Nilawar <badal.nilawar@intel.com>
Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com>
Link: https://lore.kernel.org/r/20250924102208.9216-1-colin.i.king@gmail.com
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2025-09-30 10:31:44 -04:00
Thomas Hellström
8f1756a7ea drm/xe/bo: Fix an idle assertion for local bos
Before calling ttm_bo_populate() in the CPU fault path of a bo,
we assert that the bo is not being migrated. However, for
local bos we share the reservation object with other local bos
that might be in the process of being migrated. Also some VM
operations may attach USAGE_KERNEL fences to the common
reservation object and trigger false positives from the assert.

So remove the assert and instead wait for bo idle. This may
unnecessarily wait for idle in some cases but since we're
doing this wait later in the fault path anyway we might as
well do it here as well.

This fixes warnings like:
Sep 25 14:56:23 desky kernel: ------------[ cut here ]------------
Sep 25 14:56:23 desky kernel: xe 0000:03:00.0: [drm] Assertion `dma_resv_test_signaled(tbo->base.resv, DMA_RESV_USAGE_KERNEL) || (tbo->ttm && ttm_tt_is_populated(tbo->ttm))` failed!
                              platform: BATTLEMAGE subplatform: 1
                              graphics: Xe2_HPG 20.01 step A0
                              media: Xe2_HPM 13.01 step A1
Sep 25 14:56:23 desky kernel: WARNING: CPU: 6 PID: 24767 at drivers/gpu/drm/xe/xe_bo.c:1748 xe_bo_fault_migrate+0x1bb/0x300 [xe]
Sep 25 14:56:23 desky kernel: Modules linked in: cpuid dm_crypt xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 bridge stp llc xfrm_user xfr>
Sep 25 14:56:23 desky kernel:  snd_soc_sdca snd_seq_midi prime_numbers coretemp snd_seq_midi_event drm_ttm_helper snd_hda_codec drm_buddy drm_exec snd_rawmidi snd_soc_core snd_hda_cor>
Sep 25 14:56:23 desky kernel: CPU: 6 UID: 1000 PID: 24767 Comm: steamwebhelper Tainted: G     U  W           6.17.0-rc7+ #32 PREEMPT(voluntary)
Sep 25 14:56:23 desky kernel: Tainted: [U]=USER, [W]=WARN
Sep 25 14:56:23 desky kernel: Hardware name: Micro-Star International Co., Ltd. MS-7D36/PRO Z690-P DDR4 (MS-7D36), BIOS A.A1 10/18/2022
Sep 25 14:56:23 desky kernel: RIP: 0010:xe_bo_fault_migrate+0x1bb/0x300 [xe]
Sep 25 14:56:23 desky kernel: Code: fa 64 29 f9 48 c7 c7 40 e0 d3 c1 51 48 c7 c1 c0 e3 d3 c1 52 4c 8b 45 c0 41 50 44 8b 4d c8 4d 89 e0 48 8b 55 a8 e8 25 27 95 ef <0f> 0b 48 83 c4 40 4>
Sep 25 14:56:23 desky kernel: RSP: 0000:ffffae1ca88c7b10 EFLAGS: 00010286
Sep 25 14:56:23 desky kernel: RAX: 0000000000000000 RBX: ffff8d7cfd7e6800 RCX: 0000000000000027
Sep 25 14:56:23 desky kernel: RDX: ffff8d845019cec8 RSI: 0000000000000001 RDI: ffff8d845019cec0
Sep 25 14:56:23 desky kernel: RBP: ffffae1ca88c7bc8 R08: 0000000000000000 R09: 0000000000000000
Sep 25 14:56:23 desky kernel: R10: 0000000000000000 R11: 0000000000000004 R12: ffffffffc1db1faa
Sep 25 14:56:23 desky kernel: R13: ffffffffc1db2ab4 R14: 0000000000000001 R15: ffffae1ca88c7bd8
Sep 25 14:56:23 desky kernel: FS:  00007fb1baf31940(0000) GS:ffff8d849c870000(0000) knlGS:0000000000000000
Sep 25 14:56:23 desky kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep 25 14:56:23 desky kernel: CR2: 00007fb1b2860020 CR3: 00000001705a9004 CR4: 0000000000772ef0
Sep 25 14:56:23 desky kernel: PKRU: 55555558
Sep 25 14:56:23 desky kernel: Call Trace:
Sep 25 14:56:23 desky kernel:  <TASK>
Sep 25 14:56:23 desky kernel:  xe_bo_cpu_fault_fastpath+0x11e/0x220 [xe]
Sep 25 14:56:23 desky kernel:  xe_bo_cpu_fault+0x84/0x410 [xe]
Sep 25 14:56:23 desky kernel:  ? __x64_sys_mmap+0x33/0x50
Sep 25 14:56:23 desky kernel:  ? x64_sys_call+0x1b2e/0x20d0
Sep 25 14:56:23 desky kernel:  ? do_syscall_64+0x9d/0x1f0
Sep 25 14:56:23 desky kernel:  ? __check_object_size+0x4a/0x2e0
Sep 25 14:56:23 desky kernel:  __do_fault+0x36/0x190
Sep 25 14:56:23 desky kernel:  do_fault+0xcf/0x570
Sep 25 14:56:23 desky kernel:  __handle_mm_fault+0x92b/0xfe0
Sep 25 14:56:23 desky kernel:  ? ktime_get_mono_fast_ns+0x39/0xd0
Sep 25 14:56:23 desky kernel:  handle_mm_fault+0x164/0x2c0
Sep 25 14:56:23 desky kernel:  do_user_addr_fault+0x2cb/0x840
Sep 25 14:56:23 desky kernel:  exc_page_fault+0x75/0x180
Sep 25 14:56:23 desky kernel:  asm_exc_page_fault+0x27/0x30
Sep 25 14:56:23 desky kernel: RIP: 0033:0x7fb1bc388bb7
Sep 25 14:56:23 desky kernel: Code: 48 ff c7 48 01 fe 48 8d 54 11 80 0f 1f 84 00 00 00 00 00 c5 fe 6f 0e c5 fe 6f 56 20 c5 fe 6f 5e 40 c5 fe 6f 66 60 48 83 ee 80 <c5> fd 7f 0f c5 fd 7>
Sep 25 14:56:23 desky kernel: RSP: 002b:00007ffd7814fad8 EFLAGS: 00010207
Sep 25 14:56:23 desky kernel: RAX: 00007fb1b2860000 RBX: 0000000000000690 RCX: 00007fb1b2860000
Sep 25 14:56:23 desky kernel: RDX: 00007fb1b2860610 RSI: 0000556eda79f4c0 RDI: 00007fb1b2860020
Sep 25 14:56:23 desky kernel: RBP: 00007ffd7814fb60 R08: 0000000000000000 R09: 000000012be0e000
Sep 25 14:56:23 desky kernel: R10: 00007fb1b2860000 R11: 0000000000000246 R12: 0000556edd39a240
Sep 25 14:56:23 desky kernel: R13: 00007fb1b2dcb010 R14: 0000556eda79f420 R15: 0000000000000000
Sep 25 14:56:23 desky kernel:  </TASK>

Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/5250
Fixes: c2ae94cf8c ("drm/xe: Convert the CPU fault handler for exhaustive eviction")
Cc: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://lore.kernel.org/r/20250929112649.6131-1-thomas.hellstrom@linux.intel.com
2025-09-30 10:33:51 +02:00
Michal Wajdeczko
65774efef2 drm/xe/debugfs: Update xe_pat_dump signature
Our debugfs helper xe_gt_debugfs_show_with_rpm() expects print()
functions to return int. New signature allows us to drop wrapper.

While around, move kernel-doc closer to the function definition,
as suggested in the doc-guide.

Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Link: https://lore.kernel.org/r/20250923211613.193347-6-michal.wajdeczko@intel.com
2025-09-30 10:21:28 +02:00
Michal Wajdeczko
ab6ccd4f7e drm/xe/debugfs: Update xe_mocs_dump signature
Our debugfs helper xe_gt_debugfs_show_with_rpm() expects print()
functions to return int. New signature allows us to drop wrapper.

While around, move kernel-doc closer to the function definition,
as suggested in the doc-guide.

Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Link: https://lore.kernel.org/r/20250923211613.193347-5-michal.wajdeczko@intel.com
2025-09-30 10:21:27 +02:00
Michal Wajdeczko
8980530abf drm/xe/debugfs: Update xe_tuning_dump signature
Our debugfs helper xe_gt_debugfs_show_with_rpm() expects print()
functions to return int. New signature allows us to drop wrapper.

While around, print additional separation lines using puts() to
avoid output with leading \n which might confuse some printers.

Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Link: https://lore.kernel.org/r/20250923211613.193347-4-michal.wajdeczko@intel.com
2025-09-30 10:21:26 +02:00
Michal Wajdeczko
d06e0c33f3 drm/xe/debugfs: Update xe_wa_dump signature
Our debugfs helper xe_gt_debugfs_show_with_rpm() expects print()
functions to return int. New signature allows us to drop wrapper.

While around, print additional separation lines using puts() to
avoid output with leading \n which might confuse some printers.

Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Link: https://lore.kernel.org/r/20250923211613.193347-3-michal.wajdeczko@intel.com
2025-09-30 10:21:24 +02:00
Michal Wajdeczko
103094205d drm/xe/debugfs: Update xe_gt_topology_dump signature
Our debugfs helper xe_gt_debugfs_show_with_rpm() expects print()
functions to return int. New signature allows us to drop wrapper.

Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Link: https://lore.kernel.org/r/20250923211613.193347-2-michal.wajdeczko@intel.com
2025-09-30 10:21:23 +02:00
Michal Wajdeczko
486d7f1bd1 drm/xe/pf: Make GGTT/LMEM debugfs files per-tile
Due to initial design of the Xe debugfs, the GGTT and LMEM files
were defined on the primary GT, instead of being per-tile.

While PF provisioning code is now still maintaining GGTT and LMEM
also on the per primary-GT level, this will be refactored soon,
but we can fix debugfs layout now, as part of the new SR-IOV tree.

For backward compatibility we will provide some symlinks that can
be removed once our tools will be fully converted.

As we are making all those changes in the user facing interface,
take this as apportunity to also start replacing the "LMEM" term,
used by the SR-IOV code, with the "VRAM" term, used by Xe driver.

Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://lore.kernel.org/r/20250928140029.198847-7-michal.wajdeczko@intel.com
2025-09-30 00:03:52 +02:00
Michal Wajdeczko
8cd71c40e9 drm/xe/debugfs: Promote xe_tile_debugfs_simple_show
We will want to use this helper function in other files.

Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://lore.kernel.org/r/20250928140029.198847-6-michal.wajdeczko@intel.com
2025-09-29 23:58:48 +02:00
Michal Wajdeczko
9a719bbf8d drm/xe/pf: Move SR-IOV GT debugfs files to new tree
Instead of expanding GT debugfs directories with large number of
SR-IOV files, as those are replicated per each SR-IOV function,
move them to our new debugfs tree, organized by the function.

But to avoid breaking IGT tests that use current layout, provide
symlinks which could be removed once transition period is over,
or we can we can leave them for convenience.

Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://lore.kernel.org/r/20250928140029.198847-5-michal.wajdeczko@intel.com
2025-09-29 23:58:47 +02:00
Michal Wajdeczko
5489e7d44a drm/xe/pf: Populate SR-IOV debugfs tree with tiles
Populate new per SR-IOV function debugfs directories with next
level directories that represent tiles. There are no files yet,
but we will continue updating that tree in upcoming patches.

Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://lore.kernel.org/r/20250928140029.198847-4-michal.wajdeczko@intel.com
2025-09-29 23:58:44 +02:00