CCS save/restore batch buffers are attached during BO allocation and
detached during BO teardown. The shrinker triggers xe_bo_move(), which is
used for both allocation and deletion paths.
When BO allocation and shrinking occur concurrently, a circular locking
dependency involving fs_reclaim and swap_guard can occur, leading to a
deadlock such as:
*===============================================================*
* WARNING: possible circular locking dependency detected *
*---------------------------------------------------------------*
* *
* CPU0 CPU1 *
* ---- ---- *
* lock(fs_reclaim); *
* lock(&sa_manager->swap_guard); *
* lock(fs_reclaim); *
* lock(&sa_manager->swap_guard); *
* *
* *** DEADLOCK *** *
*===============================================================*
To avoid this, the BB pointer and SA are allocated using xe_bb_alloc()
before taking lock and SA is initialized using xe_bb_init() preventing
reclaim from being invoked in this context.
Fixes: 864690cf4d ("drm/xe/vf: Attach and detach CCS copy commands with BO")
Signed-off-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Maarten Lankhorst <dev@lankhorst.se>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Link: https://patch.msgid.link/20260220055519.2485681-7-satyanarayana.k.v.p@intel.com
xe_sync_entry_parse() can allocate references (syncobj, fence, chain fence,
or user fence) before hitting a later failure path. Several of those paths
returned directly, leaving partially initialized state and leaking refs.
Route these error paths through a common free_sync label and call
xe_sync_entry_cleanup(sync) before returning the error.
Fixes: dd08ebf6c3 ("drm/xe: Introduce a new DRM driver for Intel GPUs")
Cc: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Link: https://patch.msgid.link/20260219233516.2938172-5-shuicheng.lin@intel.com
In case VF's VRAM provisioning using sysfs is done by the admin
prior to VFs enabling, this provisioning will be lost as PF will
run VRAM auto-provisioning anyway. To avoid that skip this auto-
provisioning if any VF has been already provisioned with VRAM.
To help admin find any mistakes, add diagnostics messages about
which VFs were provisioned with VRAM and which were missed.
Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Reviewed-by: Piotr Piórkowski <piotr.piorkowski@intel.com>
Link: https://patch.msgid.link/20260218205553.3561-10-michal.wajdeczko@intel.com
Instead of trying very hard to find the largest fair VRAM (aka LMEM)
size that could be allocated for VFs on the current tile, pick some
smaller rounded down to power-of-two value that is more likely to be
provisioned in the same manner by the other PF instances.
In some cases, the outcome of above calculation might not be optimal,
but it's expected that admin will do fine-tuning using sysfs files.
Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Reviewed-by: Piotr Piórkowski <piotr.piorkowski@intel.com>
Link: https://patch.msgid.link/20260218205553.3561-6-michal.wajdeczko@intel.com
On current discrete platforms, PF will provision all VFs with a fair
amount of the VRAM (LMEM) during VFs enabling. However, in some cases
this automatic VRAM provisioning might be either non-reproducible or
sub-optimal. This could break VF's migration or impact performance.
Expose per-VF VRAM quota read-write sysfs attributes to allow admin
change default VRAM provisioning performed by the PF.
/sys/bus/pci/drivers/xe/BDF/
├── sriov_admin/
├── .bulk_profile
│ └── vram_quota [RW] unsigned integer
├── vf1/
│ └── profile
│ └── vram_quota [RW] unsigned integer
├── vf2/
│ └── profile
│ └── vram_quota [RW] unsigned integer
Above values represent total provisioned VRAM from all tiles where
VFs were assigned, and currently it's from all tiles always.
Note that changing VRAM provisioning is only possible when VF is
not running, otherwise GuC will complain. To make sure that given
VF is idle, triggering VF FLR might be needed.
Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Reviewed-by: Piotr Piórkowski <piotr.piorkowski@intel.com>
Acked-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Link: https://patch.msgid.link/20260218205553.3561-5-michal.wajdeczko@intel.com
reg_sr programming that applies to an engines LRC cannot be verified by
a simple CPU-based register readout because the reg_sr's values may not
be in effect if no context is executing on the hardware at the time we
check. Instead, we should verify correct reg_sr application by
searching for the register in the default_lrc.
Reviewed-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Link: https://patch.msgid.link/20260218-sr_verify-v4-4-35d6deeb3421@intel.com
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
An LRC is stored in memory as a special batchbuffer that hardware will
execute to re-load state when switching to the context; it's a
collection of register values (encoded as MI_LOAD_REGISTER_IMM commands)
and other state instructions (e.g., 3DSTATE_*). The value that will be
loaded for a given register can be determined by parsing the batchbuffer
to find MI_LRI commands and extracting the value from the offset/value
pairs it contains. Add functions to do this, which will be used in a
future patch to help verify that our expected reg_sr programming is in
place.
The implementation here returns the value as soon as it finds a match in
the LRC. Technically a register could appear multiple times (either due
to memory corruption or a hardware defect) and the last value
encountered would be the one in effect when the context resumes
execution. We can adjust the logic to keep looking and return the last
match instead of first in the future if we encounter real-world cases
where this would assist with debugging.
Reviewed-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Link: https://patch.msgid.link/20260218-sr_verify-v4-3-35d6deeb3421@intel.com
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
When applying save-restore register programming for workarounds, tuning
settings, and general device configuration we assume the programming was
successful. However there are a number of cases where the desired
reg_sr programming can become lost:
- workarounds implemented on the wrong RTP table might not get
saved/restored at the right time leading to, for example, failure to
re-apply the programming after engine resets
- some hardware registers become "locked" and can no longer be updated
after firmware or the driver finishes initializing them
- sometimes the hardware teams just made a mistake when documenting the
register and/or bits that needed to be programmed
Add a debugfs entry that will read back the registers referenced on a
GT's save-restore lists and print any cases where the desired
programming is no longer in effect. Such cases might indicate the
presence of a driver/firmware bug, might indicate that the documentation
we were following has a mistake, or might be benign (occasionally
registers have broken read-back capability preventing verification, but
previous writes were still successful and effective).
For now we only verify the GT and engine reg_sr lists. Verifying the
LRC list will require checking the expected programming against the
default_lrc contents, not the live registers (which may not reflect the
reg_sr programming if no context is actively running).
Reviewed-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Link: https://patch.msgid.link/20260218-sr_verify-v4-2-35d6deeb3421@intel.com
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
There are a few different reg_sr lists managed by the driver for
workarounds/tuning:
- gt->reg_sr
- hwe->reg_sr
- hwe->reg_lrc
The first two are not relevant to SRIOV VFs; a VF KMD does not have
access to the registers that appear on this list and it is the PF KMD's
responsibility to apply such programming on behalf of the entire system.
However the third list contains per-client values that the VF KMD needs
to ensure are incorporated whenever a new LRC is created.
Handling of reg_sr lists comes in two steps: processing an RTP table to
build a reg_sr from the relevant entries, and then applying the contents
of the reg_sr. Skipping the RTP processing (resulting in an empty
reg_sr) or skipping the application of a reg_sr are both valid ways to
avoid having a VF accidentally try to write registers it doesn't have
access to. In commit c19e705ec9 ("drm/xe/vf: Stop applying
save-restore MMIOs if VF") and commit 92a5bd3024 ("drm/xe/vf: Unblock
xe_rtp_process_to_sr for VFs") we adjusted the drivers behavior to
always process the RTP table into a reg_sr and just skipped the
application step. This works fine functionally, but can lead to
confusion during debugging since facilities like the debugfs
'register-save-restore' will still report a bunch of registers that the
VF KMD isn't actually trying to handle. It will also mislead other
upcoming debug changes.
Let's go back to skipping the RTP => reg_sr processing step, but only
for GT / hwe tables this time. This will allow LRC reg_sr handling to
continue to work, but will ensure that gt->reg_sr and hwe->reg_sr remain
empty and that debugfs reporting more accurately reflects the KMD's
behavior.
v2:
- Also skip the hwe processing in hw_engine_setup_default_state() and
xe_reg_whitelist_process_engine().
v3:
- Handle skipping via an additional parameter passed to
xe_rtp_process_to_sr() rather than adding conditions at each
callsite. (Ashutosh)
Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Ashutosh Dixit <ashutosh.dixit@intel.com>
Cc: Harish Chegondi <harish.chegondi@intel.com>
Reviewed-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Link: https://patch.msgid.link/20260218-sr_verify-v4-1-35d6deeb3421@intel.com
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
When generating the default LRC, if a register is not masked, we apply
any save-restore programming necessary via a read-modify-write sequence
that will ensure we only update the relevant bits/fields without
clobbering the rest of the register. However some of the registers that
need to be updated might be MCR registers which require steering to a
non-terminated instance to ensure we can read back a valid, non-zero
value. The steering of reads originating from a command streamer is
controlled by register CS_MMIO_GROUP_INSTANCE_SELECT. Emit additional
MI_LRI commands to update the steering before any RMW of an MCR register
to ensure the reads are performed properly.
Note that needing to perform a RMW of an MCR register while building the
default LRC is pretty rare. Most of the MCR registers that are part of
an engine's LRCs are also masked registers, so no MCR is necessary.
Fixes: f2f90989cc ("drm/xe: Avoid reading RMW registers in emit_wa_job")
Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
Reviewed-by: Balasubramani Vivekanandan <balasubramani.vivekanandan@intel.com>
Link: https://patch.msgid.link/20260206223058.387014-2-matthew.d.roper@intel.com
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
Current GT statistics use atomic64_t counters. Atomic operations incur
a global coherency penalty.
Transition to dynamic per-cpu counters using alloc_percpu(). This allows
stats to be incremented via this_cpu_add(), which compiles to a single
non-locking instruction. This approach keeps the hot-path updates local
to the CPU, avoiding expensive cross-core cache invalidation traffic.
Use for_each_possible_cpu() during aggregation and clear operations to
ensure data consistency across CPU hotplug events.
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Stuart Summers <stuart.summers@intel.com>
Link: https://patch.msgid.link/20260217200552.596718-1-matthew.brost@intel.com
Remove the unnecessary VRAM channel entry introduced in xe_hwmon_channel.
Without this, adding any new hwmon channel causes extra VRAM channel
to appear. This remained unnoticed earlier because VRAM was the
final xe hwmon channel.
v2: Use MAX_VRAM_CHANNELS with in_range() instead of
CHANNEL_VRAM_N_MAX. (Raag)
Fixes: 49a4983384 ("drm/xe/hwmon: Expose individual VRAM channel temperature")
Signed-off-by: Karthik Poosa <karthik.poosa@intel.com>
Reviewed-by: Raag Jadav <raag.jadav@intel.com>
Link: https://patch.msgid.link/20260206081655.2115439-1-karthik.poosa@intel.com
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Passing a structure by value into a function is sometimes problematic,
for a number of reasons. Of of these is a warning from the 32-bit arm
compiler:
drivers/gpu/drm/drm_gpusvm.c: In function '__drm_gpusvm_unmap_pages':
drivers/gpu/drm/drm_gpusvm.c:1152:33: note: parameter passing for argument of type 'struct drm_pagemap_addr' changed in GCC 9.1
1152 | dpagemap->ops->device_unmap(dpagemap,
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1153 | dev, *addr);
| ~~~~~~~~~~~
This particular problem is harmless since we are not mixing compiler versions
inside of the compiler. However, passing this by reference avoids the warning
along with providing slightly better calling conventions as it avoids an
extra copy on the stack.
Fixes: 75af93b3f5 ("drm/pagemap, drm/xe: Support destination migration over interconnect")
Fixes: 2df55d9e66 ("drm/xe: Support pcie p2p dma as a fast interconnect")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Link: https://patch.msgid.link/20260216134644.1025365-1-arnd@kernel.org
Acked-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
With the intermediate state gone, no longer useful. Just check against
NULL where needed.
After looking carefully, the check for allocated in xe_fb_pin.c is
unneeded. vma->node is never NULL. The check is specifically only
to check if vma->node == the bo's root tile ggtt_obj.
Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Link: https://patch.msgid.link/20260206112108.1453809-12-dev@lankhorst.se
Signed-off-by: Maarten Lankhorst <dev@lankhorst.se>
The previous code was using a complicated system with 2 balloons to
set GGTT size and adjust GGTT offset. While it works, it's overly
complicated.
A better approach is to set the offset and size when initializing GGTT,
this removes the need for adding balloons. The resize function only
needs readjust ggtt->start to have GGTT at the new offset.
This removes the need to manipulate the internals of xe_ggtt outside
of xe_ggtt, and cleans up a lot of now unneeded code.
Co-developed-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Link: https://patch.msgid.link/20260206112108.1453809-9-dev@lankhorst.se
Signed-off-by: Maarten Lankhorst <dev@lankhorst.se>
Fix all functions that use node->start to use xe_ggtt_node_addr,
and add ggtt->start to node->start.
This will make node shifting for SR-IOV VF a one-liner, instead of
manually changing each GGTT node's base address.
Also convert some uses of mutex_lock/unlock to mutex guards.
Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Link: https://patch.msgid.link/20260206112108.1453809-8-dev@lankhorst.se
Signed-off-by: Maarten Lankhorst <dev@lankhorst.se>
The GuC pagefault acknowledgment code is designed to extract the fields
needed for the acknowledgment from the producer-stored message so that
the consumer fields can be overloaded to return additional information.
The ASID is stored in the producer message; extract it from there to
future‑proof this logic.
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Francois Dugast <francois.dugast@intel.com>
Link: https://patch.msgid.link/20260212204227.2764054-3-matthew.brost@intel.com
Wa_16018737384 is one of the rare cases where the hardware teams mark a
workaround as "driver change required" rather than "permanent/temporary
workaround" in the internal workaround database, signifying that the
implementation details of the workaround should just be considered
standard programming instructions on all platforms going forward. Cases
like this are the only time that using XE_RTP_END_VERSION_UNDEFINED as an
upper bound for a workaround's IP range is warranted and correct.
However in this specific case, the register bit in question (0xE4F0[1])
simply no longer exists in hardware from Xe3 onward. Trying to write to
that bit on Xe3 or Xe3p platforms is harmless and just doesn't have any
effect, but it's possible that the register bit could get repurposed to
control something else down the road on future platforms. To avoid any
surprises in the future we should replace the unbounded upper bound in
our RTP table with a value that accurately reflects that Wa_16018737384
can only apply to Xe2 platforms.
Bspec: 56849
Reviewed-by: Dnyaneshwar Bhadane <dnyaneshwar.bhadane@intel.com>
Link: https://patch.msgid.link/20260211234735.620087-2-matthew.d.roper@intel.com
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
SRIOV VFs do not automatically have access to the XeCore fuse registers.
Add the two new registers that show up on Xe3p_XPC to the runtime
register list to grant VFs access. Since there's a single runtime
register list for all Xe3p, this will technically also grant access on
Xe3p_LPG platforms where the registers don't exist, but that should be
harmless since even if a VF tries to read a non-existent register on
those platforms it will just get back a sensible value of 0x0.
Fixes: e8100643ff ("drm/xe/xe3p_xpc: XeCore mask spans four registers")
Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
Reviewed-by: Ngai-Mint Kwan <ngai-mint.kwan@linux.intel.com>
Link: https://patch.msgid.link/20260210182519.206952-2-matthew.d.roper@intel.com
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
When the media GT is not allowed, a VF must not attempt to read
the media version from the GuC. The GuC may not be loaded, and
any attempt to communicate with it would result in a timeout
and a VF probe failure:
(...)
[ 1912.406046] xe 0000:01:00.1: [drm] *ERROR* Tile0: GT1: GuC mmio request 0x5507: no reply 0x5507
[ 1912.407277] xe 0000:01:00.1: [drm] *ERROR* Tile0: GT1: [GUC COMMUNICATION] MMIO send failed (-ETIMEDOUT)
[ 1912.408689] xe 0000:01:00.1: [drm] *ERROR* VF: Tile0: GT1: Failed to reset GuC state (-ETIMEDOUT)
[ 1912.413986] xe 0000:01:00.1: probe with driver xe failed with error -110
Let's skip reading the media version for VFs when the media GT is not
allowed.
v2: move the condition directly to the VF path
Fixes: 7abd69278b ("drm/xe/configfs: Add attribute to disable GT types")
Signed-off-by: Piotr Piórkowski <piotr.piorkowski@intel.com>
Cc: Matt Roper <matthew.d.roper@intel.com>
Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Shuicheng Lin <shuicheng.lin@intel.com>
Reviewed-by: Shuicheng Lin <shuicheng.lin@intel.com>
Link: https://patch.msgid.link/20260202115041.2863357-1-piotr.piorkowski@intel.com
Signed-off-by: Michał Winiarski <michal.winiarski@intel.com>
There is a unbalanced lock/unlock to gpusvm notifier lock:
[ 931.045868] =====================================
[ 931.046509] WARNING: bad unlock balance detected!
[ 931.047149] 6.19.0-rc6+xe-**************** #9 Tainted: G U
[ 931.048150] -------------------------------------
[ 931.048790] kworker/u5:0/51 is trying to release lock (&gpusvm->notifier_lock) at:
[ 931.049801] [<ffffffffa090c0d8>] drm_gpusvm_scan_mm+0x188/0x460 [drm_gpusvm_helper]
[ 931.050802] but there are no more locks to release!
[ 931.051463]
The drm_gpusvm_notifier_unlock() sits under err_free label and the
first jump to err_free is just before calling the
drm_gpusvm_notifier_lock() causing unbalanced unlock.
Fixes: f1d08a5864 ("drm/gpusvm: Introduce a function to scan the current migration state")
Signed-off-by: Maciej Patelczyk <maciej.patelczyk@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Link: https://patch.msgid.link/20260209123433.1271053-1-maciej.patelczyk@intel.com
The PSS_CHICKEN register has been part of the RCS engine's LRC since it
was first introduced in Xe_LP. That means that any workarounds that
adjust its value (such as Wa_14019988906 and Wa_14019877138) need to be
implemented in the lrc_was[] table so that they become part of the
default LRC from which all subsequent LRCs are copied. Although these
workarounds were implemented correctly on most platforms, they were
incorrectly placed on the engine_was[] table for Xe2_HPG.
Move the workarounds to the proper lrc_was[] table and switch the
'xe_rtp_match_first_render_or_compute' rule to specifically match the
RCS since that's the engine whose LRC manages the register.
Bspec: 65182
Fixes: 7f3ee7d880 ("drm/xe/xe2hpg: Add initial GT workarounds")
Reviewed-by: Shekhar Chauhan <shekhar.chauhan@intel.com>
Link: https://patch.msgid.link/20260205220508.51905-2-matthew.d.roper@intel.com
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
On NVL-P, the primary GT's WOPCM gained an extra 8MiB for the Memory
URB. As such, we need to bump the maximum size in the driver so that
the driver is able to load without erroring out thinking that the WOPCM
is too small.
FIXME: The wopcm code in xe driver is a bit confusing. For the case
where the offsets for GUC WOPCM are already locked, it appears we are
using the maximum overall WOPCM size instead of the sizes relative to
each type of GT. The function __check_layout() should be checking
against the latter.
Bspec: 67090
Reviewed-by: Dnyaneshwar Bhadane <dnyaneshwar.bhadane@intel.com>
Link: https://patch.msgid.link/20260206-nvl-p-upstreaming-v3-15-636e1ad32688@intel.com
Signed-off-by: Gustavo Sousa <gustavo.sousa@intel.com>