TLB invalidations requests can be lost during VF post-migration
recovery. Since the VF has migrated, these invalidations are no longer
needed.
Reset the TLB invalidation frontend, which will signal all pending
fences.
v3:
- Move TLB invalidation reset after pausing submission (Tomasz)
- Adjust commit message (Michal)
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Tomasz Lis <tomasz.lis@intel.com>
Link: https://lore.kernel.org/r/20251008214532.3442967-22-matthew.brost@intel.com
Flushing CTs (i.e., progressing all pending G2H messages) gives VF
post-migration recovery an accurate view of which H2G messages the GuC
has processed, enabling the GuC submission state machine to correctly
rebuild all state.
Also, stop all CT traffic, as the CT is not live during VF
post-migration recovery.
v3:
- xe_guc_ct_flush_and_stop rename (Michal)
- Drop extra GuC CT WQ wake up (Michal)
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Link: https://lore.kernel.org/r/20251008214532.3442967-21-matthew.brost@intel.com
The only case where the GuC submission backend cannot reason 100%
correctly is when a GuC context is registered during VF post-migration
recovery. In this scenario, it's possible that the GuC context register
H2G is processed, but the immediately following schedule-enable H2G gets
lost. The schedule-enable G2H "done" response is how the GuC state machine
determines whether context registration has completed.
A double register is harmless when using `GUC_HXG_TYPE_EVENT`, as GuC
simply drops the duplicate H2G. To keep things simple, use
`GUC_HXG_TYPE_EVENT` for all context registrations on VFs.
v5:
- Check for xe_sriov_vf_migration_supported (Tomasz)
v7:
- Add comment about subsequent protocol failures (Tomasz)
- Modify commit message (Michal)
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Reviewed-by: Tomasz Lis <tomasz.lis@intel.com>
Link: https://lore.kernel.org/r/20251008214532.3442967-20-matthew.brost@intel.com
Blocking in work queues on a hardware action that may never occur —
especially when it depends on a software fixup also scheduled on the
a work queue — is a recipe for deadlock. This situation arises with
the preempt rebind worker and VF post-migration recovery. To prevent
potential deadlocks, avoid indefinite blocking in the preempt rebind
worker for VFs that support migration.
v4:
- Use dma_fence_wait_timeout (CI)
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Tomasz Lis <tomasz.lis@intel.com>
Link: https://lore.kernel.org/r/20251008214532.3442967-19-matthew.brost@intel.com
If VF post-migration recovery is in progress, the recovery flow will
rebuild all GuC submission state. In this case, exit all waiters to
ensure that submission queue scheduling can also be paused. Avoid taking
any adverse actions after aborting the wait.
As part of waking up the GuC backend, suspend_wait can now return
-EAGAIN indicating the waiter should be retried. If the caller is
running on work item, that work item need to be requeued to avoid a
deadlock for the work item blocking the VF migration recovery work item.
v3:
- Don't block in preempt fence work queue as this can interfere with VF
post-migration work queue scheduling leading to deadlock (Testing)
- Use xe_gt_recovery_inprogress (Michal)
v5:
- Use static function for vf_recovery (Michal)
- Add helper to wake CT waiters (Michal)
- Move some code to following patch (Michal)
- Adjust commit message to explain suspend_wait returning -EAGAIN (Michal)
- Add kernel doc to suspend_wait around returning -EAGAIN
v7:
- Add comment on why a shared wait queue is need on VFs (Michal)
- Guard again suspend_wait signaling early on resfix donw (Tomasz)
v8:
- Fix kernel doc (CI)
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Link: https://lore.kernel.org/r/20251008214532.3442967-18-matthew.brost@intel.com
With well-behaved software, a GT reset should never occur, nor should it
happen during VF post-migration recovery. If it does, trigger a warning
but suppress the GT reset, as VF post-migration recovery is expected to
bring the VF back to a working state.
v3:
- Better commit message (Tomasz)
v5:
- Use xe_gt_WARN_ON (Michal)
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Tomasz Lis <tomasz.lis@intel.com>
Link: https://lore.kernel.org/r/20251008214532.3442967-17-matthew.brost@intel.com
As multi-GT VF post-migration recovery can run in parallel on different
workqueues, but both GTs point to the same GGTT, only one GT needs to
shift the GGTT. However, both GTs need to know when this step has
completed. To coordinate this, perform the GGTT shift under the GGTT
lock. With shift being done under the lock, storing the shift value
becomes unnecessary.
In addition to above, move the GGTT VF config from the GT to the tile.
v3:
- Update commmit message (Tomasz)
v4:
- Move GGTT values to tile state (Michal)
- Use GGTT lock (Michal)
v5:
- Only take GGTT lock during recovery (CI)
- Drop goto in vf_get_submission_cfg (Michal)
- Add kernel doc around recovery in xe_gt_sriov_vf_query_config (Michal)
v7:
- Drop recovery variable (Michal)
- Use _locked naming (Michal)
- Use guard (Michal)
v9:
- Break LMEM changes into different patch (Michal)
- Fix layering (Michal)
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Link: https://lore.kernel.org/r/20251008214532.3442967-15-matthew.brost@intel.com
VF recovery is a per-GT operation, so it makes sense to isolate it to a
per-GT queue. Scheduling this operation on the same worker as the GT
reset and TDR not only aligns with this design but also helps avoid race
conditions, as those operations can also modify the queue state.
v2:
- Fix lockdep splat (Adam)
- Use xe_sriov_vf_migration_supported helper
v3:
- Drop xe_gt_sriov_ prefix for private functions (Michal)
- Drop message in xe_gt_sriov_vf_migration_init_early (Michal)
- Logic rework in vf_post_migration_notify_resfix_done (Michal)
- Rework init sequence layering (Michal)
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Tomasz Lis <tomasz.lis@intel.com>
Link: https://lore.kernel.org/r/20251008214532.3442967-10-matthew.brost@intel.com
Add xe_gt_recovery_pending helper.
This helper serves as the singular point to determine whether a GT
recovery is currently in progress. Expected callers include the GuC CT
layer and the GuC submission layer. Atomically visable as soon as vCPU
are unhalted until VF recovery completes.
v3:
- Add GT layer xe_gt_recovery_inprogress (Michal)
- Don't blow up in memirq not enabled (CI)
- Add __memirq_received with clear argument (Michal)
- xe_memirq_sw_int_0_irq_pending rename (Michal)
- Use offset in xe_memirq_sw_int_0_irq_pending (Michal)
v4:
- Refactor xe_gt_recovery_inprogress logic around memirq (Michal)
v5:
- s/inprogress/pending (Michal)
v7:
- Fix typos, adjust comment (Michal)
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Reviewed-by: Tomasz Lis <tomasz.lis@intel.com>
Link: https://lore.kernel.org/r/20251008214532.3442967-9-matthew.brost@intel.com
The LRC W/A currently checks for LRC being iomem in some places, while
in others it checks if the scratch buffer is non-NULL. This
inconsistency causes issues with the VF post-migration recovery code,
which blindly passes in a scratch buffer.
This patch standardizes the check by consistently verifying whether the
LRC is iomem to determine if the scratch buffer should be used.
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Tomasz Lis <tomasz.lis@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://lore.kernel.org/r/20251008214532.3442967-8-matthew.brost@intel.com
Now that we save the job's head during submission, it's no longer
necessary to adjust the LRC ring head during resubmission. Instead, a
software-based adjustment of the tail will overwrite the old jobs in
place. For some odd reason, adjusting the LRC ring head didn't work on
parallel queues, which was causing issues in our CI.
v5:
- Add comment in guc_exec_queue_start explaning why the function works
(Auld)
v7:
- Only adjust first state on first unsignaled job (Auld)
v8:
- Break unsignaled job handling to separate patch (Auld)
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Tomasz Lis <tomasz.lis@intel.com>
Link: https://lore.kernel.org/r/20251008214532.3442967-7-matthew.brost@intel.com
In all cases where the first pending job helper is called, we only want
to retrieve the first unsignaled pending job, as this helper is used
exclusively in recovery flows. It is possible for signaled jobs to
remain in the pending list as the scheduler is stopped, so those should
be skipped.
Also, add kernel documentation to clarify this behavior.
v8:
- Split out into own patch (Auld)
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Link: https://lore.kernel.org/r/20251008214532.3442967-6-matthew.brost@intel.com
VF migration requires jobs to remain pending so they can be replayed
after the VF comes back. Previously, LR job fences were intentionally
signaled immediately after submission to avoid the risk of exporting
them, as these fences do not naturally signal in a timely manner and
could break dma-fence contracts. A side effect of this approach was that
LR jobs were never added to the DRM scheduler’s pending list, preventing
them from being tracked for later resubmission.
We now avoid signaling LR job fences and ensure they are never exported;
Xe already guards against exporting these internal fences. With that
guarantee in place, we can safely track LR jobs in the scheduler’s
pending list so they are eligible for resubmission during VF
post-migration recovery (and similar recovery paths).
An added benefit is that LR queues now gain the DRM scheduler’s built-in
flow control over ring usage rather than rejecting new jobs in the exec
IOCTL if the ring is full.
v2:
- Ensure DRM scheduler TDR doesn't run for LR jobs
- Stack variable for killed_or_banned_or_wedged
v4:
- Clarify commit message (Tomasz)
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Tomasz Lis <tomasz.lis@intel.com>
Link: https://lore.kernel.org/r/20251008214532.3442967-5-matthew.brost@intel.com
Add explicit tracking in the GuC submission state to record the source
of a pending enable (TDR vs. queue resume path vs. submission).
Disambiguating the origin lets the GuC submission state machine apply
the correct recovery/replay behavior.
This helps VF restore: when the device comes back, the state machine knows
whether the pending enable stems from timeout recovery, from a queue resume
sequence, or submission and can gate sequencing and fixups accordingly.
v4:
- Clarify commit message (Tomasz)
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Tomasz Lis <tomasz.lis@intel.com>
Link: https://lore.kernel.org/r/20251008214532.3442967-4-matthew.brost@intel.com
We already have device and GT level SR-IOV specific macros, but
unlike native case, we don't have yet tile-based ones.
Add macros to match native use case and also update GT-based
macros to rely on those new tile-based SR-IOV macros. This will
slightly rearrange the output of the GT logs and instead:
[...] Tile0: GT0: PF: pushed VF1 config with 2 KLVs...
we might see:
[...] PF: Tile0: GT0: pushed VF1 config with 2 KLVs...
but that's even better.
Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Reviewed-by: Piotr Piórkowski <piotr.piorkowski@intel.com>
Link: https://lore.kernel.org/r/20251005133641.2651-3-michal.wajdeczko@intel.com
While the late PF per-GT initialization is done quite late in the
single GT initialization flow, in case of multi-GT platforms, it
may still be done before other GT early initialization. That leads
to some issues during unwind, when there are cross-GT dependencies,
like resource cleanup that is shared by both GTs, but the other GT
may already be sanitized or disabled.
The following errors could be observed when trying to unload the PF
driver with some LMEM/VRAM already provisioned for few VFs:
[ ] xe 0000:03:00.0: DEVRES REL ffff88814708f240 fini_config (16 bytes)
[ ] xe 0000:03:00.0: [drm:lmtt_write_pte [xe]] PF: LMTT: WRITE level=2 index=1 pte=0x0
[ ] xe 0000:03:00.0: [drm:lmtt_invalidate_hw [xe]] PF: LMTT: num_fences=2 err=-19
[ ] xe 0000:03:00.0: [drm:lmtt_pt_free [xe]] PF: LMTT: level=0 addr=53a470000
[ ] xe 0000:03:00.0: [drm:lmtt_pt_free [xe]] PF: LMTT: level=1 addr=53a4b0000
[ ] xe 0000:03:00.0: [drm:lmtt_invalidate_hw [xe]] PF: LMTT: num_fences=2 err=-19
[ ] xe 0000:03:00.0: [drm] PF: LMTT0 invalidation failed (-ENODEV)
[ ] xe 0000:03:00.0: [drm:lmtt_write_pte [xe]] PF: LMTT: WRITE level=2 index=2 pte=0x0
[ ] xe 0000:03:00.0: [drm:lmtt_invalidate_hw [xe]] PF: LMTT: num_fences=2 err=-19
[ ] xe 0000:03:00.0: [drm:lmtt_pt_free [xe]] PF: LMTT: level=0 addr=539b70000
[ ] xe 0000:03:00.0: [drm:lmtt_pt_free [xe]] PF: LMTT: level=1 addr=539bf0000
[ ] xe 0000:03:00.0: [drm:lmtt_invalidate_hw [xe]] PF: LMTT: num_fences=2 err=-19
[ ] xe 0000:03:00.0: [drm] PF: LMTT0 invalidation failed (-ENODEV)
Move all PF per-GT late initialization to the already defined late
SR-IOV initialization function to allow proper order of the cleanup
actions.
While around, format all PF function stubs as one-liners, like many
other stubs are defined in the Xe driver.
Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com>
Reviewed-by: Piotr Piórkowski <piotr.piorkowski@intel.com>
Link: https://lore.kernel.org/r/20251004162008.1782-1-michal.wajdeczko@intel.com
Both vm->xef and XE_LRC_CREATE_USER_CTX indicate in xe_lrc_init that
the context originates from userspace. However, XE_LRC_CREATE_USER_CTX
has a broader scope as it may be set even when no vm->xef is present.
The XE_BO_FLAG_PINNED_LATE_RESTORE flag can be extended to both cases,
so there is no point in handling the two cases separately.
Let's combine vm->xef and XE_LRC_CREATE_USER_CTX checks to detect
userspace context.
Signed-off-by: Piotr Piórkowski <piotr.piorkowski@intel.com>
Suggested-by: Matthew Auld <matthew.auld@intel.com>
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Link: https://lore.kernel.org/r/20251003162619.1984236-6-piotr.piorkowski@intel.com
When using a separate VRAM region for kernel allocations,
some kernel structures, such as context userspace data,
should not reside in the VRAM region dedicated to the kernel.
The VRAM kernel region is intended only for allocations necessary
for driver operation. Allocations created via ioctl are long-lived
and not easily evictable. If this region runs out of space,
there may not be a fallback, which could cause failures.
To prevent this, add a new BO flag that explicitly forces the BO to be
allocated in the general-purpose VRAM region accessible to userspace,
avoiding the kernel-only VRAM region.
v2:
- update commit message (Matthew)
Signed-off-by: Piotr Piórkowski <piotr.piorkowski@intel.com>
Cc: Matthew Auld <matthew.auld@intel.com>
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Link: https://lore.kernel.org/r/20251003162619.1984236-3-piotr.piorkowski@intel.com
So far, kernel and userspace allocations have shared the same VRAM region.
However, in some scenarios, it may be necessary to reserve a separate
VRAM area exclusively for kernel allocations.
Let's add preliminary support for such a configuration.
v2:
- replaced for_each_bo_flag_vram with the improved
for_each_set_bo_vram_flag helper (Matthew)
- moved the VRAM flag iteration macro definition into xe_bo.c (Matthew)
- drop unused bo_flgas from bo_vram_flags_to_vram_placement (Matthew)
- use hweight32 helper in __xe_bo_fixed_placement for readability
(Matthew)
v3: remove unnecessary VRAM fixup id
Signed-off-by: Piotr Piórkowski <piotr.piorkowski@intel.com>
Cc: Matthew Auld <matthew.auld@intel.com>
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Link: https://lore.kernel.org/r/20251003162619.1984236-2-piotr.piorkowski@intel.com
We already have control functions that we use to control the VF
state on the per-GT basis, but that is low level detail from the
user point of view, who rather expects VF-level functions.
For now add simple functions that just iterate over all GTs and
call per-GT control function. We will soon allow to use some of
them from the user facing interfaces like debugfs.
Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Reviewed-by: Michał Winiarski <michal.winiarski@intel.com>
Link: https://lore.kernel.org/r/20250930233525.201263-2-michal.wajdeczko@intel.com