Commit Graph

1295319 Commits

Author SHA1 Message Date
Maarten Lankhorst
501d799a47 drm/xe: Wire up device shutdown handler
The system is turning off, and we should probably put the device
in a safe power state. We don't need to evict VRAM or suspend running
jobs to a safe state, as the device is rebooted anyway.

This does not imply the system is necessarily reset, as we can
kexec into a new kernel. Without shutting down, things like
USB Type-C may mysteriously start failing.

References: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/3500
Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
[mlankhorst: Add !xe_driver_flr_disabled assert]
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240905150052.174895-4-maarten.lankhorst@linux.intel.com
2024-09-11 19:07:57 +02:00
Maarten Lankhorst
f90491d4b6 drm/xe: Remove runtime argument from display s/r functions
The previous change ensures that pm_suspend is only called when
suspending or resuming. This ensures no further bugs like those
in the previous commit.

Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Reviewed-by: Vinod Govindapillai <vinod.govindapillai@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240905150052.174895-3-maarten.lankhorst@linux.intel.com
2024-09-11 18:31:13 +02:00
Maarten Lankhorst
474f64cb98 drm/xe: Fix missing conversion to xe_display_pm_runtime_resume
This error path was missed when converting away from
xe_display_pm_resume with second argument.

Fixes: 66a0f6b9f5 ("drm/xe/display: handle HPD polling in display runtime suspend/resume")
Cc: Arun R Murthy <arun.r.murthy@intel.com>
Cc: Vinod Govindapillai <vinod.govindapillai@intel.com>
Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Reviewed-by: Vinod Govindapillai <vinod.govindapillai@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240905150052.174895-2-maarten.lankhorst@linux.intel.com
2024-09-11 18:31:09 +02:00
Tejas Upadhyay
9db969b36b drm/xe/xe2hpg: Add Wa_15016589081
Wa_15016589081 applies to xe2_hpg renderCS

V2(Gustavo)
  - rename bit macro

Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
Reviewed-by: Gustavo Sousa <gustavo.sousa@intel.com>
Reviewed-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240904101333.2049655-1-tejas.upadhyay@intel.com
Signed-off-by: Nirmoy Das <nirmoy.das@intel.com>
2024-09-11 13:09:00 +02:00
Thomas Hellström
5a90b60db5 drm/xe: Add a xe_bo subtest for shrinking / swapping
Add a subtest that tries to allocate twice the amount of
buffer object memory available, write data to it and then read
all the data back verifying data integrity.
In order to be able to do this on systems that
have no or not enough swap-space available, allocate some memory
as purgeable, and introduce a function to purge such memory from
the TTM swap_notify path.

this test is intended to add test coverage to the current
bo swap path and upcoming shrinking path.

The test has previously been part of the xe bo shrinker series.

v2:
- Skip test if the execution time is expected to be too long.
- Minor code cleanups.

v3:
- Print random seed. (Matthew Auld)

Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Matthew Auld <matthew.auld@intel.com>
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240909085654.5064-1-thomas.hellstrom@linux.intel.com
2024-09-11 09:11:33 +02:00
Arnd Bergmann
1c129ed07d drm/xe: fix build warning with CONFIG_PM=n
The 'runtime_status' field is an implementation detail of the
power management code, so a device driver should not normally
touch this:

drivers/gpu/drm/xe/xe_pm.c: In function 'xe_pm_suspending_or_resuming':
drivers/gpu/drm/xe/xe_pm.c:606:26: error: 'struct dev_pm_info' has no member named 'runtime_status'
  606 |         return dev->power.runtime_status == RPM_SUSPENDING ||
      |                          ^
drivers/gpu/drm/xe/xe_pm.c:607:27: error: 'struct dev_pm_info' has no member named 'runtime_status'
  607 |                 dev->power.runtime_status == RPM_RESUMING;
      |                           ^
drivers/gpu/drm/xe/xe_pm.c:608:1: error: control reaches end of non-void function [-Werror=return-type]

Add an #ifdef check to avoid the build regression.

Fixes: cb85e39dc5 ("drm/xe: Suppress missing outer rpm protection warning")
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Link: https://patchwork.freedesktop.org/patch/msgid/20240909202521.1018439-1-arnd@kernel.org
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2024-09-09 16:49:18 -04:00
Michal Wajdeczko
f2710d9572 drm/xe: Don't keep stale pointer to bo->ggtt_node
When we fail to map a BO in the GGTT, we release our GGTT node
placeholder, but leave stale bo->ggtt_node pointer to it, which
triggers an assert immediately followed by a crash, due to UAF:

[ ] xe 0000:00:02.0: [drm] Assertion `bo->ggtt_node->base.size == bo->size` failed!
[ ] WARNING: CPU: 4 PID: 126 at drivers/gpu/drm/xe/xe_ggtt.c:689 xe_ggtt_remove_bo+0x1d9/0x250 [xe]
[ ] RIP: 0010:xe_ggtt_remove_bo+0x1d9/0x250 [xe]
[ ] Call Trace:
[ ]  <TASK>
[ ]  ? __warn+0x88/0x190
[ ]  ? xe_ggtt_remove_bo+0x1d9/0x250 [xe]
[ ]  ? report_bug+0x1c3/0x1d0
[ ]  ? handle_bug+0x42/0x70
[ ]  ? exc_invalid_op+0x14/0x70
[ ]  ? asm_exc_invalid_op+0x16/0x20
[ ]  ? xe_ggtt_remove_bo+0x1d9/0x250 [xe]
[ ]  ? xe_ggtt_remove_bo+0x1d9/0x250 [xe]
[ ]  xe_ttm_bo_destroy+0x11f/0x260 [xe]
[ ]  ? ttm_bo_release+0x31c/0x350 [ttm]
[ ]  ? __mutex_unlock_slowpath+0x35/0x270
[ ]  __xe_bo_create_locked+0x4a0/0x550 [xe]
[ ]  ? mark_held_locks+0x49/0x80
[ ]  xe_bo_create_pin_map_at+0x37/0x200 [xe]
[ ]  xe_bo_create_pin_map+0x11/0x20 [xe]

While around, for similar reason, also don't keep an error pointer
if we fail to allocate ggtt_node placeholder.

Fixes: 34e804220f ("drm/xe: Make xe_ggtt_node struct independent")
Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240906220348.1836-1-michal.wajdeczko@intel.com
2024-09-09 20:08:55 +02:00
Lucas De Marchi
3fe62f7bfd drm/xe: Mark reserved engines in snapshot
When printing <debufs>/gt*/hw_engines, it's useful to mark
what engines are reserved so it doesn't mislead developers
while debugging.

Cc: José Roberto de Souza <jose.souza@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: José Roberto de Souza <jose.souza@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240906205609.3131330-1-lucas.demarchi@intel.com
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2024-09-09 10:33:14 -07:00
Lucas De Marchi
ceb29504dd drm/xe: Fix arg to pci_iomap()
Commit 2d8865b277 ("drm/xe: Move BAR definitions to dedicated file")
moved the BAR definition to the header, but replaced the wrong arg in
the pci_iomap() function - the last arg is actuall the length, not the
BAR. Luckily GTTMMADR_BAR == 0, so it still works. Fix the argument
to avoid confusion.

Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Reviewed-by: Alan Previn <alan.previn.teres.alexis@intel.com>
Reviewed-by: Matt Roper <matthew.d.roper@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240906032507.2952859-1-lucas.demarchi@intel.com
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2024-09-09 10:32:32 -07:00
Lucas De Marchi
0c841e47d8 drm/xe: Update runtime detection of has_flat_ccs
It's confusing to have a *set* function that actually probes the
hardware rather than receiving a parameter. Rename it to *probe* along
with prefix removal and comment in the relevant places that the
has_flat_ccs flag may be overridden in runtime.

While at it, fix the mixed declaration of struct xe_gt.

Reviewed-by: Matt Roper <matthew.d.roper@intel.com>
Reviewed-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240904162238.2831202-2-lucas.demarchi@intel.com
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2024-09-06 10:19:42 -07:00
Lucas De Marchi
b43723f864 drm/xe: Cleanup has_flat_ccs handling
The flag is set in XE_HP_FEATURES, but then overridden in all but one
xe_graphics_desc. Make it set only where needed.

Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com>
Reviewed-by: Matt Roper <matthew.d.roper@intel.com>
Reviewed-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240904162238.2831202-1-lucas.demarchi@intel.com
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2024-09-06 10:19:36 -07:00
Dafna Hirschfeld
249df8cbec drm/xe: fix missing 'xe_vm_put'
Fix memleak caused by missing xe_vm_put

Fixes: 852856e3b6 ("drm/xe: Use reserved copy engine for user binds on faulting devices")
Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Nirmoy Das <nirmoy.das@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240901044227.1177211-1-dhirschfeld@habana.ai
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2024-09-06 12:33:29 -04:00
Rodrigo Vivi
cb85e39dc5 drm/xe: Suppress missing outer rpm protection warning
Do not raise a WARN if we are likely within suspending or resuming
path. This is likely this false positive:

rpm_status:           0000:03:00.0 status=RPM_SUSPENDING
console:              xe_bo_evict_all (called from suspend)
xe_sched_job_create:  dev=0000:03:00.0, ...
xe_sched_job_exec:    dev=0000:03:00.0, ...
xe_pm_runtime_put:    dev=0000:03:00.0, ...
xe_sched_job_run:     dev=0000:03:00.0, ...
rpm_usage:            0000:03:00.0 flags-0 cnt-2  ...
rpm_usage:            0000:03:00.0 flags-0 cnt-2  ...
rpm_usage:            0000:03:00.0 flags-0 cnt-2  ...
console:              xe 0000:03:00.0: [drm] Missing outer runtime
                                                     PM protection
console:               xe_guc_ct_send+0x15/0x50 [xe]
console:               guc_exec_queue_run_job+0x1509/0x3950 [xe]
[snip]
console:               drm_sched_run_job_work+0x649/0xc20

At this point, BOs are getting evicted from VRAM with rpm
usage-counter = 2, but rpm status = SUSPENDING.

The xe->pm_callback_task won't be equal 'current' because this call is
coming from a work queue.

So, pm_runtime_get_if_active() will be called and return 0 because rpm
status != ACTIVE (but equal SUSPENDING or RESUMING).

v2: Still get the reference even on non suspending/resuming
    path (Jonathan, Brost).

Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Matthew Auld <matthew.auld@intel.com>
Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240905140215.56404-1-rodrigo.vivi@intel.com
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2024-09-06 12:15:01 -04:00
Riana Tauro
0914c1e45d drm/xe/xe_gt_idle: add debugfs entry for powergating info
Coarse Powergating is a power saving technique where Render and Media
can be power-gated independently irrespective of the rest of the GT.

For debug purposes, it is useful to expose the powergating information.

v2: move to debugfs
    add details to commit message
    add per-slice status for media
    define reg bits in descending order (Matt Roper)

v3: fix return statement
    fix kernel-doc
    use loop for media slices
    use helper function for status (Michal)

v4: add pg prefix
    do not wake GT if in C6 (Badal)

Signed-off-by: Riana Tauro <riana.tauro@intel.com>
Reviewed-by: Badal Nilawar <badal.nilawar@intel.com>
Acked-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240906071126.28078-3-riana.tauro@intel.com
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2024-09-06 12:13:30 -04:00
Riana Tauro
c2bf07dd0b drm/xe/xe_gt_idle: modify powergate enable condition
Modify powergate enable condition based on the type of GT or presence of
media engines. Also have a copy of the value written to powergate enable
register.

v2: add condition to enable render or media powergating (Badal)

v3: fix commit message (Shekhar)
    fix kernel-doc

Signed-off-by: Riana Tauro <riana.tauro@intel.com>
Reviewed-by: Shekhar Chauhan <shekhar.chauhan@intel.com>
Reviewed-by: Badal Nilawar <badal.nilawar@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240906071126.28078-2-riana.tauro@intel.com
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2024-09-06 12:13:30 -04:00
Jani Nikula
5ea28f921a drm/xe: use IS_ENABLED() instead of defined() on config options
Prefer IS_ENABLED() instead of defined() for checking whether a kconfig
option is enabled.

Reviewed-by: Badal Nilawar <badal.nilawar@intel.com>
Reviewed-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240904145231.3902289-1-jani.nikula@intel.com
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
2024-09-06 16:32:49 +03:00
Jani Nikula
cdb56a63f7 drm/xe/pciids: separate ARL and MTL PCI IDs
Avoid including PCI IDs for one platform to the PCI IDs of another. It's
more clear to deal with them completely separately at the PCI ID macro
level.

Reviewed-by: Shekhar Chauhan <shekhar.chauhan@intel.com>
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/a30cb0da7694a8eccceba66d676ac59aa0e96176.1725443121.git.jani.nikula@intel.com
2024-09-06 15:40:05 +03:00
Jani Nikula
d454902a69 drm/xe/pciids: separate RPL-U and RPL-P PCI IDs
Avoid including PCI IDs for one platform to the PCI IDs of another. It's
more clear to deal with them completely separately at the PCI ID macro
level.

Reviewed-by: Sai Teja Pottumuttu <sai.teja.pottumuttu@intel.com>
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/4868d36fbfa8c38ea2d490bca82cf6370b8d65dd.1725443121.git.jani.nikula@intel.com
2024-09-06 15:39:55 +03:00
Jani Nikula
ae304b0545 drm/xe/pciids: add some missing ADL-N PCI IDs
Similar to commit 425b463859 ("drm/i915: Update ADL-N PCI IDs").

Reviewed-by: Sai Teja Pottumuttu <sai.teja.pottumuttu@intel.com>
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/47d543393e4026588401a03c4e3ce12ce29780e3.1725443121.git.jani.nikula@intel.com
2024-09-06 15:39:20 +03:00
Matthew Auld
1ff14648dc drm/xe/pat: sanity check compression and coh_mode
There is an implicit assumption in the driver that compression and
coh_1way+ are mutually exclusive. If this is ever not true then userptr
and imported dma-buf from external device will have uncleared ccs state.
Add a build bug for this so we don't forget.

Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Nirmoy Das <nirmoy.das@intel.com>
Reviewed-by: Nirmoy Das <nirmoy.das@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240828092257.169063-2-matthew.auld@intel.com
2024-09-06 09:35:28 +01:00
Matthew Auld
89076b5a8b drm/xe: prevent potential UAF in pf_provision_vf_ggtt()
The node ptr can point to an already freed ptr, if we hit the path with
an already allocated node. We later dereference that pointer with:

	xe_gt_assert(gt, !xe_ggtt_node_allocated(node));

which is a potential UAF. Fix this by not stashing the ptr for node.
Also since it is likely a bad idea to leave config->ggtt_region pointing
to a stale ptr, also set that to NULL by calling
pf_release_vf_config_ggtt() instead of pf_release_ggtt().

Fixes: 34e804220f ("drm/xe: Make xe_ggtt_node struct independent")
Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240828104341.180111-2-matthew.auld@intel.com
2024-09-06 09:35:05 +01:00
Nitin Gote
cd89de14bb drm/xe: Replace double space with single space after comma
Avoid using double space, ",  " in function or macro parameters
where it's not required by any alignment purpose. Replace it with
a single space, ", ".

Signed-off-by: Nitin Gote <nitin.r.gote@intel.com>
Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240823080643.2461992-1-nitin.r.gote@intel.com
Signed-off-by: Nirmoy Das <nirmoy.das@intel.com>
2024-09-05 18:20:00 +02:00
Michal Wajdeczko
13a48a0fa5 drm/xe/pf: Sanitize VF scratch registers on FLR
Some VF accessible registers (like GuC scratch registers) must be
explicitly reset during the FLR. While this is today done by the GuC
firmware, according to the design, this should be responsibility of
the PF driver, as future platforms may require more registers to be
reset. Likewise GuC, the PF can access VFs registers by adding some
platform specific offset to the original register address.

Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Reviewed-by: Piotr Piórkowski <piotr.piorkowski@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240902192953.1792-1-michal.wajdeczko@intel.com
2024-09-05 18:09:24 +02:00
Thomas Hellström
34bb7b813a drm/xe: Use xe_pm_runtime_get in xe_bo_move() if reclaim-safe.
xe_bo_move() might be called in the TTM swapout path from validation
by another TTM device. If so, we are not likely to have a RPM
reference. So iff xe_pm_runtime_get() is safe to call from reclaim,
use it instead of xe_pm_runtime_get_noresume().

Strictly this is currently needed only if handle_system_ccs is true,
but use xe_pm_runtime_get() if possible anyway to increase test
coverage.

At the same time warn if handle_system_ccs is true and we can't
call xe_pm_runtime_get() from reclaim context. This will likely trip
if someone tries to enable SRIOV on LNL, without fixing Xe SRIOV
runtime resume / suspend.

Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Matthew Auld <matthew.auld@intel.com>
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240903094232.166342-1-thomas.hellstrom@linux.intel.com
2024-09-04 09:28:09 +02:00
Rodrigo Vivi
8da19441d0 drm/xe/display: Avoid encoder_suspend at runtime suspend
Fix circular locking dependency on runtime suspend.

<4> [74.952215] ======================================================
<4> [74.952217] WARNING: possible circular locking dependency detected
<4> [74.952219] 6.10.0-rc7-xe #1 Not tainted
<4> [74.952221] ------------------------------------------------------
<4> [74.952223] kworker/7:1/82 is trying to acquire lock:
<4> [74.952226] ffff888120548488 (&dev->mode_config.mutex){+.+.}-{3:3}, at: drm_modeset_lock_all+0x40/0x1e0 [drm]
<4> [74.952260]
but task is already holding lock:
<4> [74.952262] ffffffffa0ae59c0 (xe_pm_runtime_lockdep_map){+.+.}-{0:0}, at: xe_pm_runtime_suspend+0x2f/0x340 [xe]
<4> [74.952322]
which lock already depends on the new lock.

The commit 'b1d90a86 ("drm/xe: Use the encoder suspend helper also used
by the i915 driver")' didn't do anything wrong. It actually fixed a
critical bug, because the encoder_suspend was never getting actually
called because it was returning if (has_display(xe)) instead of
if (!has_display(xe)). However, this ended up introducing the encoder
suspend calls in the runtime routines as well, causing the circular
locking dependency.

Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/2304
Fixes: b1d90a862c ("drm/xe: Use the encoder suspend helper also used by the i915 driver")
Cc: Imre Deak <imre.deak@intel.com>
Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240830183507.298351-2-rodrigo.vivi@intel.com
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2024-09-03 12:47:00 -04:00
Rodrigo Vivi
bc947d9a8c drm/xe: Add missing runtime reference to wedged upon gt_reset
Fixes this missed case:

xe 0000:00:02.0: [drm] Missing outer runtime PM protection
WARNING: CPU: 99 PID: 1455 at drivers/gpu/drm/xe/xe_pm.c:564 xe_pm_runtime_get_noresume+0x48/0x60 [xe]
Call Trace:
<TASK>
? show_regs+0x67/0x70
? __warn+0x94/0x1b0
? xe_pm_runtime_get_noresume+0x48/0x60 [xe]
? report_bug+0x1b7/0x1d0
? handle_bug+0x46/0x80
? exc_invalid_op+0x19/0x70
? asm_exc_invalid_op+0x1b/0x20
? xe_pm_runtime_get_noresume+0x48/0x60 [xe]
xe_device_declare_wedged+0x91/0x280 [xe]
gt_reset_worker+0xa2/0x250 [xe]

v2: Also move get and get the right Fixes tag (Himal, Brost)

Fixes: fb74b205cd ("drm/xe: Introduce a simple wedged state")
Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com>
Reviewed-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240830183507.298351-1-rodrigo.vivi@intel.com
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2024-09-03 12:46:39 -04:00
Michal Wajdeczko
9f6b47907e drm/xe: Remove redundant [drm] tag from xe_assert() message
Since commit 178c0a33c4 ("drm/print: Add generic drm dev printk
function") the output from drm_WARN() includes previously missing
the [drm] tag, so now xe_assert() is printing it twice:

  [ ] xe 0000:00:02.0: [drm] [drm] Assertion `false` failed!

Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240902190726.1748-1-michal.wajdeczko@intel.com
2024-09-03 11:42:42 +02:00
Michal Wajdeczko
da6ec74339 drm/xe/pf: Reset thresholds when releasing a VF config
As part of the VF config release, we should reset all parameters,
including thresholds, to always start with the clean VF config.

Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Reviewed-by: Piotr Piórkowski <piotr.piorkowski@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240830132100.1704-3-michal.wajdeczko@intel.com
2024-09-02 20:51:03 +02:00
Michal Wajdeczko
a1498ab229 drm/xe/pf: Add thresholds to the VF KLV config
We are pushing threshold KLV to the GuC immediately during the
threshold provisioning, but those configs will be lost during a
GT reset.  Include threshold KLVs while encoding full VF config
buffer to make sure the GuC receives all of the config KLVs.

Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Reviewed-by: Piotr Piórkowski <piotr.piorkowski@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240830132100.1704-2-michal.wajdeczko@intel.com
2024-09-02 20:51:02 +02:00
Matthew Brost
bf758226c7 drm/xe: Invalidate media_gt TLBs in PT code
Testing on LNL has shown media GT's TLBs need to be invalidated via the
GuC, update PT code appropriately.

v2:
 - Do dma_fence_get before first call of invalidation_fence_init (Himal)
 - No need to check for valid chain fence (Himal)
v3:
 - Use dma-fence-array

Fixes: 3330361543 ("drm/xe/lnl: Add LNL platform definition")
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Acked-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240826170144.2492062-3-matthew.brost@intel.com
2024-08-30 11:41:26 -07:00
Matthew Brost
ddc94d0b17 dma-buf: Split out dma fence array create into alloc and arm functions
Useful to preallocate dma fence array and then arm in path of reclaim or
a dma fence.

v2:
 - s/arm/init (Christian)
 - Drop !array warn (Christian)
v3:
 - Fix kernel doc typos (dim)

Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240826170144.2492062-2-matthew.brost@intel.com
2024-08-30 11:41:05 -07:00
Matt Roper
20f61c1ead drm/xe/hwmon: Treat hwmon as a per-device concept
There's only one instance of hwmon per device, and MMIO access to it is
always done through the root tile.  The code has been passing around a
pointer to the root tile's primary GT, which is confusing since this
isn't really a GT-level concept.  Replace that pointer with an xe_device
pointer and use xe_root_mmio_gt(xe) to get a pointer when we need to do
register MMIO.  This makes things easier to follow, and also cleans up
the code in preparation for a much larger MMIO register access overhaul
that's coming soon.

Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240829220619.789159-6-matthew.d.roper@intel.com
2024-08-30 08:56:23 -07:00
Matt Roper
3034cc8107 drm/xe/pcode: Treat pcode as per-tile rather than per-GT
There's only one instance of the pcode per tile, and for GT-related
accesses both the primary and media GT share the same register
interface.  Since Xe was using per-GT locking, the pcode mutex wasn't
actually protecting everything that it should since concurrent accesses
related to a tile's primary GT and media GT were possible.

Fixes: dd08ebf6c3 ("drm/xe: Introduce a new DRM driver for Intel GPUs")
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240829220619.789159-5-matthew.d.roper@intel.com
2024-08-30 08:56:20 -07:00
Matt Roper
cad08fa776 drm/xe/display: Drop unnecessary xe_gt.h includes
None of the Xe display files work directly with the GT or need anything
from xe_gt.h.  Drop the unnecessary include.

Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
Reviewed-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240829230307.886233-2-matthew.d.roper@intel.com
2024-08-30 08:55:55 -07:00
Nirmoy Das
c5f728de69 drm/xe: Fix memory leak on xe_alloc_pf_queue failure
Simplify memory unwinding on error also fixing current memory
leak that can happen on error.

v2: use devm_kcalloc(Matt A)

Fixes: 3338e4f90c ("drm/xe: Use topology to determine page fault queue size")
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Stuart Summers <stuart.summers@intel.com>
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240826162035.20462-1-nirmoy.das@intel.com
Signed-off-by: Nirmoy Das <nirmoy.das@intel.com>
2024-08-30 11:38:53 +02:00
Michal Wajdeczko
2bd87f0fc2 drm/xe/pf: Improve VF control
Our initial VF control implementation was focused on providing
a very minimal support for the VF_STATE_NOTIFY events just to
meet GuC requirements, without tracking a VF state or doing any
expected actions (like cleanup in case of the FLR notification).

Try to improve this by defining set of VF state machines, each
responsible for processing one activity (PAUSE, RESUME, STOP or
FLR). All required steps defined by the VF state machine are then
executed by the PF worker from the dedicated workqueue.

Any external requests or notifications simply try to transition
between the states to trigger a work and then wait for that work
to finish. Some predefined default timeouts are used to avoid
changing existing API calls, but it should be easy to extend the
control API to also accept specific timeout values.

Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Piotr Piórkowski <piotr.piorkowski@intel.com>
Reviewed-by: Piotr Piórkowski <piotr.piorkowski@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240828210809.1528-5-michal.wajdeczko@intel.com
2024-08-30 10:51:09 +02:00
Michal Wajdeczko
d69300abc2 drm/xe/pf: Drop GuC notifications for non-existing VF
It is unlikely that GuC will ever send a G2H notification with an
invalid VFID and it is currently harmless if that actually happen.
But in upcoming patches we will start using that VFID as an index
and we must be sure it is a valid to avoid a crash due to a buggy
firmware or a currupted G2H message.

Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Reviewed-by: Piotr Piórkowski <piotr.piorkowski@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240828210809.1528-4-michal.wajdeczko@intel.com
2024-08-30 10:51:08 +02:00
Michal Wajdeczko
65fe9617a1 drm/xe/pf: Fix documentation formatting
Current formatting of "The VF FLR Flow with GuC" only looks fine,
but it will not render properly when included in htmldocs due to:

  WARNING: Block quote ends without a blank line; unexpected unindent.
  CRITICAL: Missing matching underline for section title overline.

Fix that by adding proper indent and using list markup.

Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Reviewed-by: Piotr Piórkowski <piotr.piorkowski@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240828210809.1528-3-michal.wajdeczko@intel.com
2024-08-30 10:51:07 +02:00
Michal Wajdeczko
16ba2b28df drm/xe/pf: Add function to sanitize VF resources
On current platforms it is a PF driver responsibility to clear
some of the VF's resources during a VF FLR. Add simple function
that will clear configured VF resources (GGTT, LMEM). We will
start using this function soon.

Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Reviewed-by: Piotr Piórkowski <piotr.piorkowski@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240828210809.1528-2-michal.wajdeczko@intel.com
2024-08-30 10:51:06 +02:00
Daniele Ceraolo Spurio
02a416afbe drm/xe/gsc: Wedge the device if the GSCCS reset fails
Due to the special handling of the GSCCS in HW, we can't escalate to GT
reset when we receive the reset failure interrupt; the specs indicate
that we should trigger an FLR instead, but we do not have support for
that at the moment, so the HW will stay permanently in a broken state.
We should therefore mark the device as wedged, the same as if the GT
reset had failed.

Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Reviewed-by: Julia Filipchuk <julia.filipchuk@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240828221457.2752868-1-daniele.ceraolospurio@intel.com
2024-08-29 14:18:52 -07:00
Daniele Ceraolo Spurio
5ee2d63ca1 drm/xe/gsc: Add debugfs to print GSC info
This is useful for debug, in case something goes wrong with the GSC. The
info includes the version information and the current value of the HECI1
status registers.

Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Cc: John Harrison <John.C.Harrison@Intel.com>
Cc: Alan Previn <alan.previn.teres.alexis@intel.com>
Reviewed-by: Julia Filipchuk <julia.filipchuk@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240828215158.2743994-5-daniele.ceraolospurio@intel.com
2024-08-29 10:32:20 -07:00
Daniele Ceraolo Spurio
f7c2ea682d drm/xe/gsc: Track the platform in the compatibility version
The GSC compatibility version number is reset for each new platform. To
indicate this, the version includes a number that identifies the
platform (102 = MTL, 104 = LNL); this matches what happens for the
release version, where the major number also identifies a platform.

To make it clearer in our logs that the compatibility version is
specific to the platform, it is useful to include this platform number.
However, given that our binary names already include the platform, it is
not necessary to add this extra number there.

Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Cc: John Harrison <John.C.Harrison@Intel.com>
Cc: Alan Previn <alan.previn.teres.alexis@intel.com>
Reviewed-by: Julia Filipchuk <julia.filipchuk@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240828215158.2743994-4-daniele.ceraolospurio@intel.com
2024-08-29 10:32:19 -07:00
Daniele Ceraolo Spurio
7293859c51 drm/xe/gsc: Fix FW status if the firmware is already loaded
We set the FW status to "TRANSFERRED" after the load completes and to
"RUNNING"once we're done with proxy init, so do the same if we're trying
to re-load the FW and it is already loaded.

Note that there is no difference in driver behavior between the 2
states, but it's useful to be accurate when we dump the status for
debug.

Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Cc: John Harrison <John.C.Harrison@Intel.com>
Cc: Alan Previn <alan.previn.teres.alexis@intel.com>
Reviewed-by: Julia Filipchuk <julia.filipchuk@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240828215158.2743994-3-daniele.ceraolospurio@intel.com
2024-08-29 10:32:18 -07:00
Daniele Ceraolo Spurio
2160f6f6e3 drm/xe/gsc: Do not attempt to load the GSC multiple times
The GSC HW is only reset by driver FLR or D3cold entry. We don't support
the former at runtime, while the latter is only supported on DGFX, for
which we don't support GSC. Therefore, if GSC failed to load previously
there is no need to try again because the HW is stuck in the error state.

An assert has been added so that if we ever add DGFX support we'll know
we need to handle the D3 case.

v2: use "< 0" instead of "!= 0" in the FW state error check (Julia).

Fixes: dd0e89e5ed ("drm/xe/gsc: GSC FW load")
Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Cc: John Harrison <John.C.Harrison@Intel.com>
Cc: Alan Previn <alan.previn.teres.alexis@intel.com>
Cc: <stable@vger.kernel.org> # v6.8+
Reviewed-by: Julia Filipchuk <julia.filipchuk@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240828215158.2743994-2-daniele.ceraolospurio@intel.com
2024-08-29 10:32:17 -07:00
Jani Nikula
87d8ecf015 drm/xe: replace #include <drm/xe_drm.h> with <uapi/drm/xe_drm.h>
include/drm/xe_drm.h does not exist. Prefer the explicit uapi include.

Signed-off-by: Jani Nikula <jani.nikula@intel.com>
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240827091539.4136838-1-jani.nikula@intel.com
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2024-08-28 15:17:54 -04:00
Karthik Poosa
a7f657097e drm/xe/hwmon: Fix WRITE_I1 param from u32 to u16
WRITE_I1 sub-command of the POWER_SETUP pcode command accepts a u16
parameter instead of u32. This change prevents potential illegal
sub-command errors.

v2: Mask uval instead of changing the prototype. (Badal)

v3: Rephrase commit message. (Badal)

Signed-off-by: Karthik Poosa <karthik.poosa@intel.com>
Fixes: 92d44a422d ("drm/xe/hwmon: Expose card reactive critical power")
Reviewed-by: Badal Nilawar <badal.nilawar@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240827155301.183383-1-karthik.poosa@intel.com
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2024-08-28 14:59:26 -04:00
Ilia Levi
aeb4ae66cb drm/xe: move the kernel lrc from hwe to execlist port
The kernel lrc is used solely by the execlist infra.
Move it to the execlist port struct and initialize it only when
execlists are used.

v2: Rebase, improve error handling readability (Jonathan)

Signed-off-by: Ilia Levi <ilia.levi@intel.com>
Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240826100655.1719060-1-ilia.levi@intel.com
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2024-08-28 14:50:13 -04:00
Balasubramani Vivekanandan
3adcf970dc drm/xe/bmg: Drop force_probe requirement
Battlemage platform is sufficiently tested and found stable. CI is also
pretty stable. Remove the force_probe requirement to enable the platform
support by default.

Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Jani Nikula <jani.nikula@intel.com>
Signed-off-by: Balasubramani Vivekanandan <balasubramani.vivekanandan@intel.com>
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240828082152.3194814-1-balasubramani.vivekanandan@intel.com
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2024-08-28 10:47:03 -07:00
Himal Prasad Ghimiray
c72084163c drm/xe: Fix NPD in ggtt_node_remove()
Make sure that ggtt_node_remove() is invoked only if both node and ggtt
are not null. Move the null checks to the caller function
xe_ggtt_node_remove().

v2: Move null check below declarations (Tejas)

Fixes: 919bb54e98 ("drm/xe: Fix missing runtime outer protection for ggtt_remove_node")
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Cc: Tejas Upadhyay <tejas.upadhyay@intel.com>
Reviewed-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240828092229.3606503-1-himal.prasad.ghimiray@intel.com
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2024-08-28 13:15:40 -04:00
Thomas Hellström
379cad69bd drm/xe: Use separate rpm lockdep map for non-d3cold-capable devices
For non-d3cold-capable devices we'd like to be able to wake up the
device from reclaim. In particular, for Lunar Lake we'd like to be
able to blit CCS metadata to system at shrink time; at least from
kswapd where it's reasonable OK to wait for rpm resume and a
preceding rpm suspend.

Therefore use a separate lockdep map for such devices and prime it
reclaim-tainted.

v2:
- Rename lockmap acquire- and release functions. (Rodrigo Vivi).
- Reinstate the old xe_pm_runtime_lockdep_prime() function and
  rename it to xe_rpm_might_enter_cb(). (Matthew Auld).
- Introduce a separate xe_pm_runtime_lockdep_prime function
  called from module init for known required locking orders.
v3:
- Actually hook up the prime function at module init.
v4:
- Rebase.
v5:
- Don't use reclaim-safe RPM with sriov.

Cc: "Vivi, Rodrigo" <rodrigo.vivi@intel.com>
Cc: "Auld, Matthew" <matthew.auld@intel.com>
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240826143450.92511-1-thomas.hellstrom@linux.intel.com
2024-08-28 16:29:15 +02:00