For parallel exec queues, xe_exec_ioctl() copied the batch buffer address
array from userspace without checking num_batch_buffer.
If user creates a sync-only exec that doesn't use the address field, the
exec will fail with -EFAULT.
Add num_batch_buffer check to skip the copy, and the exec could be executed
successfully.
Here is the sync-only exec:
struct drm_xe_exec exec = {
.extensions = 0,
.exec_queue_id = qid,
.num_syncs = 1,
.syncs = (uintptr_t)&sync,
.address = 0, /* ignored for sync-only */
.num_batch_buffer = 0, /* sync-only */
};
Fixes: dd08ebf6c3 ("drm/xe: Introduce a new DRM driver for Intel GPUs")
Cc: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Link: https://patch.msgid.link/20260122214053.3189366-2-shuicheng.lin@intel.com
The VRAM/stolen memory managers do not currently set
DRM_BUDDY_CONTIGUOUS_ALLOCATION for contiguous allocations. Enabling
this flag activates the buddy allocator's try_harder path, which helps
handle fragmented memory scenarios.
This enables the __alloc_contig_try_harder fallback in the buddy
allocator, allowing contiguous allocation requests to succeed even when
memory is fragmented by combining allocations from both(RHS and LHS)
sides of a large free block.
v2: (Matt B)
- Remove redundant logic for rounding allocation size and trimming when
TTM_PL_FLAG_CONTIGUOUS is set, since drm_buddy now handles this when
DRM_BUDDY_CONTIGUOUS_ALLOCATION is enabled
Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/6713
Suggested-by: Matthew Auld <matthew.auld@intel.com>
Signed-off-by: Sanjay Yadav <sanjay.kumar.yadav@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Link: https://patch.msgid.link/20260121111416.3104399-2-sanjay.kumar.yadav@intel.com
Currently this is very broken if someone attempts to create a bind
queue and share it across multiple VMs. For example currently we assume
it is safe to acquire the user VM lock to protect some of the bind queue
state, but if allow sharing the bind queue with multiple VMs then this
quickly breaks down.
To fix this reject using a bind queue with any VM that is not the same
VM that was originally passed when creating the bind queue. This a uAPI
change, however this was more of an oversight on kernel side that we
didn't reject this, and expectation is that userspace shouldn't be using
bind queues in this way, so in theory this change should go unnoticed.
Based on a patch from Matt Brost.
v2 (Matt B):
- Hold the vm lock over queue create, to ensure it can't be closed as
we attach the user_vm to the queue.
- Make sure we actually check for NULL user_vm in destruction path.
v3:
- Fix error path handling.
Fixes: dd08ebf6c3 ("drm/xe: Introduce a new DRM driver for Intel GPUs")
Reported-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: José Roberto de Souza <jose.souza@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Michal Mrozek <michal.mrozek@intel.com>
Cc: Carl Zhang <carl.zhang@intel.com>
Cc: <stable@vger.kernel.org> # v6.8+
Acked-by: José Roberto de Souza <jose.souza@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Arvind Yadav <arvind.yadav@intel.com>
Acked-by: Michal Mrozek <michal.mrozek@intel.com>
Link: https://patch.msgid.link/20260120110609.77958-3-matthew.auld@intel.com
Introduce context-based invalidation support to the GuC TLB invalidation
backend. This is implemented by iterating over each exec queue per GT
within a VM, skipping inactive queues, and issuing a context-based (GuC
ID) H2G TLB invalidation. All H2G messages, except the final one, are
sent with an invalid seqno, which the G2H handler drops to ensure the
TLB invalidation fence is only signaled once all H2G messages are
completed.
A watermark mechanism is also added to switch between context-based TLB
invalidations and full device-wide invalidations, as the return on
investment for context-based invalidation diminishes when many exec
queues are mapped.
v2:
- Fix checkpatch warnings
v3:
- Rebase on PRL
- Use ref counting to avoid racing with deregisters
v4:
- Extra braces (Stuart)
- Use per GT list (CI)
- Reorder put
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Tested-by: Stuart Summers <stuart.summers@intel.com>
Reviewed-by: Stuart Summers <stuart.summers@intel.com>
Link: https://patch.msgid.link/20260116221731.868657-12-matthew.brost@intel.com
In multi-queue mode, we only have control over the entire group, so we
cannot ban individual queues or signal fences until the whole group is
removed from hardware. Implement banning of the entire group if any job
within it times out.
v2:
- Fix lock inversion (Niranjana)
- Initialize new queues in group to stopped
v3:
- Blindly call xe_exec_queue_multi_queue_primary (Niranjana)
- More comments around temporary list when stopping (Niranjana)
- Restart group on false timeout (Niranjana)
Cc: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Link: https://patch.msgid.link/20260116220333.861850-2-matthew.brost@intel.com
clangd reports many "unused header" warnings throughout the Xe driver.
Start working to clean this up by removing unnecessary includes in our
.c files and/or replacing them with explicit includes of other headers
that were previously being included indirectly.
By far the most common offender here was unnecessary inclusion of
xe_gt.h. That likely originates from the early days of xe.ko when
xe_mmio did not exist and all register accesses, including those
unrelated to GTs, were done with GT functions.
There's still a lot of additional #include cleanup that can be done in
the headers themselves; that will come as a followup series.
v2:
- Squash the 79-patch series down to a single patch. (MattB)
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://patch.msgid.link/20260115032803.4067824-2-matthew.d.roper@intel.com
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
All MERT catastrophic errors but VF's LMTT fault are serious, so
we shouldn't limit our handling only to print debug messages.
Change CATERR message to error level and then declare the device
as wedged to match expectation from the design document. For the
LMTT faults, add a note about adding tracking of this unexpected
VF activity.
While at it, rename register fields defnitions to match the BSpec.
Also drop trailing include guard name from the regs.h file.
BSpec: 74625
Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Lukasz Laguna <lukasz.laguna@intel.com>
Reviewed-by: Lukasz Laguna <lukasz.laguna@intel.com>
Link: https://patch.msgid.link/20260112183716.28700-1-michal.wajdeczko@intel.com
This patch continues the effort to refactor workqueue APIs, which has begun
with the changes introducing new workqueues and a new alloc_workqueue flag:
commit 128ea9f6cc ("workqueue: Add system_percpu_wq and system_dfl_wq")
commit 930c2ea566 ("workqueue: Add new WQ_PERCPU flag")
The point of the refactoring is to eventually alter the default behavior of
workqueues to become unbound by default so that their workload placement is
optimized by the scheduler.
Before that to happen, workqueue users must be converted to the better named
new workqueues with no intended behaviour changes:
system_wq -> system_percpu_wq
system_unbound_wq -> system_dfl_wq
This way the old obsolete workqueues (system_wq, system_unbound_wq) can be
removed in the future.
After a carefully evaluation, because this is the fence signaling path, we
changed the code in order to use one of the Xe's workqueue.
So, a new workqueue named 'timeout_wq' has been added to
'struct xe_tlb_inval' and has been initialized with 'gt->ordered_wq'
changing the system_wq uses with tlb_inval->timeout_wq.
Link: https://lore.kernel.org/all/20250221112003.1dSuoGyc@linutronix.de/
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Link: https://patch.msgid.link/20260112094406.82641-1-marco.crivellari@suse.com
Expose individual VRAM temperature attributes.
Update Xe hwmon documentation for this entry.
v2:
- Avoid using default switch case for VRAM individual temperatures.
- Append labels with VRAM channel number.
- Update kernel version in Xe hwmon documentation.
v3:
- Add missing brackets in Xe hwmon documentation from VRAM channel sysfs.
- Reorder BMG_VRAM_TEMPERATURE_N macro in xe_pcode_regs.h.
- Add api to check if VRAM is available on the channel.
v4:
- Improve VRAM label handling to eliminate temp variable by
introducing a dedicated array vram_label in xe_hwmon_thermal_info.
- Remove a magic number.
- Change the label from vram_X to vram_ch_X.
v5:
- Address review comments from Raag.
- Change vram to VRAM in commit title and subject.
- Refactor BMG_VRAM_TEMPERATURE_N macro.
- Refactor is_vram_ch_available().
- Rephrase a comment.
- Check individual VRAM temperature limits in addition to VRAM
availability in xe_hwmon_temp_is_visible. (Raag)
- Move VRAM label change out of this patch.
v6:
- Use in_range() for VRAM_N index check instead of if check. (Raag)
- Minor aesthetic changes.
Signed-off-by: Karthik Poosa <karthik.poosa@intel.com>
Reviewed-by: Raag Jadav <raag.jadav@intel.com>
Link: https://patch.msgid.link/20260112203521.1014388-5-karthik.poosa@intel.com
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Expose GPU PCIe average temperature and its limits via hwmon sysfs entry
temp5_xxx.
Update Xe hwmon sysfs documentation for this.
v2: Update kernel version in Xe hwmon documentation. (Raag)
v3:
- Address review comments from Raag.
- Remove redundant debug log.
- Update kernel version in Xe hwmon documentation. (Raag)
v4:
- Address review comments from Raag.
- Group new temperature attributes with existing temperature attributes
as per channel index in Xe hwmon documentation.
- Use TEMP_MASK instead of TEMP_MASK_MAILBOX.
- Add PCIE_SENSOR_MASK which uses REG_FIELD_GET as replacement of
PCIE_SENSOR_SHIFT.
v5:
- Address review comments from Raag.
- Use REG_FIELD_GET to get PCIe temperature.
- Move PCIE_SENSOR_GROUP_ID and PCIE_SENSOR_MASK to xe_pcode_api.h
- Cosmetic change.
Signed-off-by: Karthik Poosa <karthik.poosa@intel.com>
Reviewed-by: Raag Jadav <raag.jadav@intel.com>
Link: https://patch.msgid.link/20260112203521.1014388-4-karthik.poosa@intel.com
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Expose GPU memory controller average temperature and its limits under
temp4_xxx.
Update Xe hwmon documentation for this.
v2:
- Rephrase commit message. (Badal)
- Update kernel version in Xe hwmon documentation. (Raag)
v3:
- Update kernel version in Xe hwmon documentation.
- Address review comments from Raag.
- Remove obvious comments.
- Remove redundant debug logs.
- Remove unnecessary checks.
- Avoid magic numbers.
- Add new comments.
- Use temperature sensors count to make memory controller visible.
- Use temperature limits of package for memory controller.
v4:
- Address review comments from Raag.
- Group new temperature attributes with existing temperature attributes
as per channel index in Xe hwmon documentation.
- Use DIV_ROUND_UP to calculate dwords needed for temperature limits.
- Minor aesthetic refinements.
- Remove unused TEMP_MASK_MAILBOX.
v5:
- Use REG_FIELD_GET to get count from READ_THERMAL_DATA output. (Raag)
- Change count print from decimal to hexadecimal.
- Cosmetic changes.
Signed-off-by: Karthik Poosa <karthik.poosa@intel.com>
Reviewed-by: Raag Jadav <raag.jadav@intel.com>
Link: https://patch.msgid.link/20260112203521.1014388-3-karthik.poosa@intel.com
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Read temperature limits using pcode mailbox and expose shutdown
temperature limit as tempX_emergency, critical temperature limit as
tempX_crit and GPU max temperature limit as temp2_max.
Update Xe hwmon documentation with above entries.
v2:
- Resolve a documentation warning.
- Address below review comments from Raag.
- Update date and kernel version in Xe hwmon documentation.
- Remove explicit disable of has_mbx_thermal_info for unsupported
platforms.
- Remove unnecessary default case in switches.
- Remove obvious comments.
- Use TEMP_LIMIT_MAX to compute number of dwords needed in
xe_hwmon_thermal_info.
- Remove THERMAL_LIMITS_DWORDS macro.
- Use has_mbx_thermal_info for checking thermal mailbox support.
v3:
- Address below minor comments. (Raag)
- Group new temperature attributes with existing temperature attributes
as per channel index in Xe hwmon documentation.
- Rename enums of xe_temp_limit to improve clarity.
- Use DIV_ROUND_UP to calculate dwords needed for temperature limits.
- Use return instead of breaks in xe_hwmon_temp_read.
- Minor aesthetic refinements.
v4:
- Remove a redundant break. (Raag)
- Update drm_dbg to drm_warn to inform user of unavailability for
thermal mailbox on expected platforms.
Signed-off-by: Karthik Poosa <karthik.poosa@intel.com>
Reviewed-by: Raag Jadav <raag.jadav@intel.com>
Link: https://patch.msgid.link/20260112203521.1014388-2-karthik.poosa@intel.com
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
On newer platforms GSC FW is only required for content protection
features, so the core driver features work perfectly fine without it
(and we did in fact not enable it to start with on PTL). Therefore, we
can selectively enable the GSC only if the FW is found on disk, without
failing if it is not found.
Note that this means that the FW can now be enabled (i.e., we're looking
for it) but not available (i.e., we haven't found it), so checks on FW
support should use the latter state to decide whether to go on or not.
As part of the rework, the message for FW not found has been cleaned up
to be more readable.
While at it, drop the comment about xe_uc_fw_init() since the code has
been reworked and the statement no longer applies.
Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Julia Filipchuk <julia.filipchuk@intel.com>
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Reviewed-by: Julia Filipchuk <julia.filipchuk@intel.com>
Link: https://patch.msgid.link/20260108011340.2562349-6-daniele.ceraolospurio@intel.com
Prior to lmem init check, driver is waiting for the pcode uncore_init
status. uncore_init status will be flagged after the complete boot and
initialization of the SoC by the pcode. uncore_init confirms that lmem
init and mmio unblock has been already completed.
It makes no sense to check for lmem init after the pcode uncore_init
check. So change the wait for lmem init check into an assert which
confirms lmem init is set.
Signed-off-by: Balasubramani Vivekanandan <balasubramani.vivekanandan@intel.com>
Reviewed-by: Matt Roper <matthew.d.roper@intel.com>
Link: https://patch.msgid.link/20251219145024.2955946-2-balasubramani.vivekanandan@intel.com
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
A careful inspection of __xe_ggtt_insert_bo_at() shows that
the ggtt_node can always be seen as inserted from xe_bo.c
due to the way error handling is performed.
The checks are also a little bit too paranoid, since we
never create a bo with ggtt_node[id] initialised but not
inserted into the GGTT, which can be seen by looking at
__xe_ggtt_insert_bo_at()
Additionally, the size of the GGTT is never bigger than 4 GB,
so adding a check at that level is incorrect.
Signed-off-by: Maarten Lankhorst <dev@lankhorst.se>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://patch.msgid.link/20260108101014.579906-14-dev@lankhorst.se
We now have proper infrastructure to accurately check the LRC timestamp
without toggling the scheduling state for non-VFs. For VFs, it is still
possible to get an inaccurate view if the context is on hardware. We
guard against free-running contexts on VFs by banning jobs whose
timestamps are not moving. In addition, VFs have a timeslice quantum
that naturally triggers context switches when more than one VF is
running, thus updating the LRC timestamp.
For multi-queue, it is desirable to avoid scheduling toggling in the TDR
because this scheduling state is shared among many queues. Furthermore,
this change simplifies the GuC state machine. The trade-off for VF cases
seems worthwhile.
v5:
- Add xe_lrc_timestamp helper (Umesh)
v6:
- Reduce number of tries on stuck timestamp (VF testing)
- Convert job timestamp save to a memory copy (VF testing)
v7:
- Save ctx timestamp to LRC when start VF job (VF testing)
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
Link: https://patch.msgid.link/20260110012739.2888434-8-matthew.brost@intel.com
The timestamp WA does not work on a VF because it requires reading MMIO
registers, which are inaccessible on a VF. This timestamp WA confuses
LRC sampling on a VF during TDR, as the LRC timestamp would always read
as 1 for any active context. Disable the timestamp WA on VFs to avoid
this confusion.
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
Fixes: 617d824c53 ("drm/xe: Add WA BB to capture active context utilization")
Link: https://patch.msgid.link/20260110012739.2888434-7-matthew.brost@intel.com
Deregistering queues in the TDR introduces unnecessary complexity,
requiring reference-counting techniques to function correctly,
particularly to prevent use-after-free (UAF) issues while a
deregistration initiated from the TDR is in progress.
All that's needed in the TDR is to kick the queue off the hardware,
which is achieved by disabling scheduling. Queue deregistration should
be handled in a single, well-defined point in the cleanup path, tied to
the queue's reference count.
v4:
- Explain why extra ref were needed prior to this patch (Niranjana)
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Link: https://patch.msgid.link/20260110012739.2888434-5-matthew.brost@intel.com