Commit Graph

1354356 Commits

Author SHA1 Message Date
Dave Airlie
4f577bed5c Merge tag 'drm-xe-next-fixes-2025-06-05' of https://gitlab.freedesktop.org/drm/xe/kernel into drm-next
Driver Changes:
- A couple of vm init fixes (Matt Auld)
- Hwmon fixes (Karthik)
- Drop reduntant conversion to bool (Raag)
- Fix CONFIG_INTEL_VSEC dependency (Arnd)
- Rework eviction rejection of bound external bos (Thomas)
- Stop re-submitting signalled jobs (Matt Auld)
- A couple of pxp fixes (Daniele)
- Add back a fix that got lost in a merge (Matt Auld)
- Create LRC bo without VM (Niranjana)
- Fix for the above fix (Maciej)

Signed-off-by: Dave Airlie <airlied@redhat.com>

From: Thomas Hellstrom <thomas.hellstrom@linux.intel.com>
Link: https://lore.kernel.org/r/aEHq44uIAZwfK-mG@fedora
2025-06-06 11:37:08 +10:00
Dave Airlie
27bba88644 Merge tag 'drm-misc-next-fixes-2025-06-05' of https://gitlab.freedesktop.org/drm/misc/kernel into drm-next
drm-misc-fixes for v6.16-rc1:
- Fixes for nt37801 panel
- Fix null deref in HDMI audio helper.
- Fixes for analogix_dp.

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Link: https://lore.kernel.org/r/14c2eff8-701d-4699-b187-08862715e1ac@linux.intel.com
2025-06-06 11:32:13 +10:00
Dave Airlie
df9d494182 Merge tag 'drm-intel-next-fixes-2025-06-05' of https://gitlab.freedesktop.org/drm/i915/kernel into drm-next
- Fix PSR register definitions for ALPM
- Fix u32 overflow in SNPS PHY HDMI PLL setup
- Fix GuC pending message underflow when submit fails
- Fix GuC wakeref underflow race during reset

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Link: https://lore.kernel.org/r/aEFW1wGnt1kTVNGF@jlahtine-mobl
2025-06-06 09:08:29 +10:00
Maciej Patelczyk
7c7c5cb5b5 drm/xe: remove unmatched xe_vm_unlock() from __xe_exec_queue_init()
There is unmatched xe_vm_unlock() in the __xe_exec_queue_init().
Leftover from commit fbeaad071a ("drm/xe: Create LRC BO without VM")

Fixes: 2b0a0ce0c2 ("drm/xe: Create LRC BO without VM")
Signed-off-by: Maciej Patelczyk <maciej.patelczyk@intel.com>
Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Link: https://lore.kernel.org/r/20250530135627.2821612-1-maciej.patelczyk@intel.com
(cherry picked from commit 28b996ce73)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
2025-06-05 18:55:46 +02:00
Niranjana Vishwanathapura
2b0a0ce0c2 drm/xe: Create LRC BO without VM
Specifying VM during lrc->bo creation requires VM's reference
to be held for the lifetime of lrc->bo as it will use VM's dma
reservation object. Using VM's dma reservation object for
lrc->bo doesn't provide any advantage. Hence do not pass VM
while creating lrc->bo.

v2: Use xe_bo_unpin_map_no_vm (Matthew Brost)

Fixes: 264eecdba2 ("drm/xe: Decouple xe_exec_queue and xe_lrc")
Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Link: https://lore.kernel.org/r/20250529052031.2429120-2-niranjana.vishwanathapura@intel.com
(cherry picked from commit fbeaad071a)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
2025-06-05 18:55:28 +02:00
Matthew Auld
2e824747cf drm/xe/guc_submit: add back fix
Daniele noticed that the fix in commit 2d2be279f1 ("drm/xe: fix UAF
around queue destruction") looks to have been unintentionally removed as
part of handling a conflict in some past merge commit. Add it back.

Fixes: ac44ff7cec ("Merge tag 'drm-xe-fixes-2024-10-10' of https://gitlab.freedesktop.org/drm/xe/kernel into drm-fixes")
Reported-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: <stable@vger.kernel.org> # v6.12+
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://lore.kernel.org/r/20250603174213.1543579-2-matthew.auld@intel.com
(cherry picked from commit 9d9fca62dc)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
2025-06-05 18:08:10 +02:00
Daniele Ceraolo Spurio
69a58ef4fa drm/xe/pxp: Clarify PXP queue creation behavior if PXP is not ready
The expected flow of operations when using PXP is to query the PXP
status and wait for it to transition to "ready" before attempting to
create an exec_queue. This flow is followed by the Mesa driver, but
there is no guarantee that an incorrectly coded (or malicious) app
will not attempt to create the queue first without querying the status.
Therefore, we need to clarify what the expected behavior of the queue
creation ioctl is in this scenario.

Currently, the ioctl always fails with an -EBUSY code no matter the
error, but for consistency it is better to distinguish between "failed
to init" (-EIO) and "not ready" (-EBUSY), the same way the query ioctl
does. Note that, while this is a change in the return code of an ioctl,
the behavior of the ioctl in this particular corner case was not clearly
spec'd, so no one should have been relying on it (and we know that Mesa,
which is the only known userspace for this, didn't).

v2: Minor rework of the doc (Rodrigo)

Fixes: 72d479601d ("drm/xe/pxp/uapi: Add userspace and LRC support for PXP-using queues")
Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Cc: John Harrison <John.C.Harrison@Intel.com>
Cc: José Roberto de Souza <jose.souza@intel.com>
Reviewed-by: José Roberto de Souza <jose.souza@intel.com>
Reviewed-by: John Harrison <John.C.Harrison@Intel.com>
Acked-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Link: https://lore.kernel.org/r/20250522225401.3953243-7-daniele.ceraolospurio@intel.com
(cherry picked from commit 21784ca960)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
2025-06-05 18:07:33 +02:00
Daniele Ceraolo Spurio
6bf4d56492 drm/xe/pxp: Use the correct define in the set_property_funcs array
The define of the extension type was accidentally used instead of the
one of the property itself. They're both zero, so no functional issue,
but we should use the correct define for code correctness.

Fixes: 41a97c4a12 ("drm/xe/pxp/uapi: Add API to mark a BO as using PXP")
Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Cc: John Harrison <John.C.Harrison@Intel.com>
Reviewed-by: John Harrison <John.C.Harrison@Intel.com>
Link: https://lore.kernel.org/r/20250522225401.3953243-6-daniele.ceraolospurio@intel.com
(cherry picked from commit 1d891ee820)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
2025-06-05 18:07:25 +02:00
Matthew Auld
0ee54d5cac drm/xe/sched: stop re-submitting signalled jobs
Customer is reporting a really subtle issue where we get random DMAR
faults, hangs and other nasties for kernel migration jobs when stressing
stuff like s2idle/s3/s4. The explosions seems to happen somewhere
after resuming the system with splats looking something like:

PM: suspend exit
rfkill: input handler disabled
xe 0000:00:02.0: [drm] GT0: Engine reset: engine_class=bcs, logical_mask: 0x2, guc_id=0
xe 0000:00:02.0: [drm] GT0: Timedout job: seqno=24496, lrc_seqno=24496, guc_id=0, flags=0x13 in no process [-1]
xe 0000:00:02.0: [drm] GT0: Kernel-submitted job timed out

The likely cause appears to be a race between suspend cancelling the
worker that processes the free_job()'s, such that we still have pending
jobs to be freed after the cancel. Following from this, on resume the
pending_list will now contain at least one already complete job, but it
looks like we call drm_sched_resubmit_jobs(), which will then call
run_job() on everything still on the pending_list. But if the job was
already complete, then all the resources tied to the job, like the bb
itself, any memory that is being accessed, the iommu mappings etc. might
be long gone since those are usually tied to the fence signalling.

This scenario can be seen in ftrace when running a slightly modified
xe_pm IGT (kernel was only modified to inject artificial latency into
free_job to make the race easier to hit):

xe_sched_job_run: dev=0000:00:02.0, fence=0xffff888276cc8540, seqno=0, lrc_seqno=0, gt=0, guc_id=0, batch_addr=0x000000146910 ...
xe_exec_queue_stop:   dev=0000:00:02.0, 3:0x2, gt=0, width=1, guc_id=0, guc_state=0x0, flags=0x13
xe_exec_queue_stop:   dev=0000:00:02.0, 3:0x2, gt=0, width=1, guc_id=1, guc_state=0x0, flags=0x4
xe_exec_queue_stop:   dev=0000:00:02.0, 4:0x1, gt=1, width=1, guc_id=0, guc_state=0x0, flags=0x3
xe_exec_queue_stop:   dev=0000:00:02.0, 1:0x1, gt=1, width=1, guc_id=1, guc_state=0x0, flags=0x3
xe_exec_queue_stop:   dev=0000:00:02.0, 4:0x1, gt=1, width=1, guc_id=2, guc_state=0x0, flags=0x3
xe_exec_queue_resubmit: dev=0000:00:02.0, 3:0x2, gt=0, width=1, guc_id=0, guc_state=0x0, flags=0x13
xe_sched_job_run: dev=0000:00:02.0, fence=0xffff888276cc8540, seqno=0, lrc_seqno=0, gt=0, guc_id=0, batch_addr=0x000000146910 ...
.....
xe_exec_queue_memory_cat_error: dev=0000:00:02.0, 3:0x2, gt=0, width=1, guc_id=0, guc_state=0x3, flags=0x13

So the job_run() is clearly triggered twice for the same job, even
though the first must have already signalled to completion during
suspend. We can also see a CAT error after the re-submit.

To prevent this only resubmit jobs on the pending_list that have not yet
signalled.

v2:
  - Make sure to re-arm the fence callbacks with sched_start().
v3 (Matt B):
  - Stop using drm_sched_resubmit_jobs(), which appears to be deprecated
    and just open-code a simple loop such that we skip calling run_job()
    on anything already signalled.

Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/4856
Fixes: dd08ebf6c3 ("drm/xe: Introduce a new DRM driver for Intel GPUs")
Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: William Tseng <william.tseng@intel.com>
Cc: <stable@vger.kernel.org> # v6.8+
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
Link: https://lore.kernel.org/r/20250528113328.289392-2-matthew.auld@intel.com
(cherry picked from commit 38fafa9f39)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
2025-06-05 18:07:15 +02:00
Thomas Hellström
5cc3325584 drm/xe: Rework eviction rejection of bound external bos
For preempt_fence mode VM's we're rejecting eviction of
shared bos during VM_BIND. However, since we do this in the
move() callback, we're getting an eviction failure warning from
TTM. The TTM callback intended for these things is
eviction_valuable().

However, the latter doesn't pass in the struct ttm_operation_ctx
needed to determine whether the caller needs this.

Instead, attach the needed information to the vm under the
vm->resv, until we've been able to update TTM to provide the
needed information. And add sufficient lockdep checks to prevent
misuse and races.

v2:
- Fix a copy-paste error in xe_vm_clear_validating()
v3:
- Fix kerneldoc errors.

Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Fixes: 0af944f0e3 ("drm/xe: Reject BO eviction if BO is bound to current VM")
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://lore.kernel.org/r/20250528164105.234718-1-thomas.hellstrom@linux.intel.com
(cherry picked from commit 9d5558649f)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
2025-06-05 18:07:07 +02:00
Arnd Bergmann
2182f358fb drm/xe/vsec: fix CONFIG_INTEL_VSEC dependency
The XE driver can be built with or without VSEC support, but fails to link as
built-in if vsec is in a loadable module:

x86_64-linux-ld: vmlinux.o: in function `xe_vsec_init':
(.text+0x1e83e16): undefined reference to `intel_vsec_register'

The normal fix for this is to add a 'depends on INTEL_VSEC || !INTEL_VSEC',
forcing XE to be a loadable module as well, but that causes a circular
dependency:

        symbol DRM_XE depends on INTEL_VSEC
        symbol INTEL_VSEC depends on X86_PLATFORM_DEVICES
        symbol X86_PLATFORM_DEVICES is selected by DRM_XE

The problem here is selecting a symbol from another subsystem, so change
that as well and rephrase the 'select' into the corresponding dependency.
Since X86_PLATFORM_DEVICES is 'default y', there is no change to
defconfig builds here.

Fixes: 0c45e76fcc ("drm/xe/vsec: Support BMG devices")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://lore.kernel.org/r/20250529172355.2395634-2-lucas.demarchi@intel.com
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
(cherry picked from commit e4931f8be3)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
2025-06-05 18:06:59 +02:00
Raag Jadav
9411082792 drm/xe: drop redundant conversion to bool
The result of integer comparison already evaluates to bool. No need for
explicit conversion.

No functional impact.

Fixes: 0e414bf7ad ("drm/xe: Expose PCIe link downgrade attributes")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202505292205.MoljmkjQ-lkp@intel.com/
Signed-off-by: Raag Jadav <raag.jadav@intel.com>
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Link: https://lore.kernel.org/r/20250529160937.490147-1-raag.jadav@intel.com
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
(cherry picked from commit 61761a6b57)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
2025-06-05 18:06:01 +02:00
Karthik Poosa
b885ae2e9d drm/xe/hwmon: Move card reactive critical power under channel card
Move power2/curr2_crit to channel 1 i.e power1/curr1_crit as this
represents the entire card critical power/current.

v2: Update the date of curr1_crit also in hwmon documentation.

Signed-off-by: Karthik Poosa <karthik.poosa@intel.com>
Fixes: 345dadc4f6 ("drm/xe/hwmon: Add infra to support card power and energy attributes")
Reviewed-by: Badal Nilawar <badal.nilawar@intel.com>
Link: https://lore.kernel.org/r/20250529163458.2354509-3-karthik.poosa@intel.com
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
(cherry picked from commit 25e963a09e)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
2025-06-05 18:05:54 +02:00
Karthik Poosa
25a2aa779f drm/xe/hwmon: Add support to manage power limits though mailbox
Add support to manage power limits using pcode mailbox commands
for supported platforms.

v2:
 - Address review comments. (Badal)
 - Use mailbox commands instead of registers to manage power limits
   for BMG.
 - Clamp the maximum power limit to GPU firmware default value.

v3:
 - Clamp power limit in write also for platforms with mailbox support.

v4:
 - Remove unnecessary debug prints. (Badal)

v5:
 - Update description of variable pl1_on_boot to fix kernel-doc error.

v6:
 - Improve commit message, refer to BIOS as GPU firmware.
 - Change macro READ_PL_FROM_BIOS to READ_PL_FROM_FW.
 - Rectify drm_warn to drm_info.

Signed-off-by: Karthik Poosa <karthik.poosa@intel.com>
Fixes: e90f7a58e6 ("drm/xe/hwmon: Add HWMON support for BMG")
Reviewed-by: Badal Nilawar <badal.nilawar@intel.com>
Link: https://lore.kernel.org/r/20250529163458.2354509-2-karthik.poosa@intel.com
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
(cherry picked from commit 7596d839f6)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
2025-06-05 18:05:44 +02:00
Matthew Auld
8cf8cde41a drm/xe/vm: move xe_svm_init() earlier
In xe_vm_close_and_put() we need to be able to call xe_svm_fini(),
however during vm creation we can call this on the error path, before
having actually initialised the svm state, leading to various splats
followed by a fatal NPD.

Fixes: 6fd979c2f3 ("drm/xe: Add SVM init / close / fini to faulting VMs")
Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/4967
Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://lore.kernel.org/r/20250514152424.149591-4-matthew.auld@intel.com
(cherry picked from commit 4f296d77cf)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
2025-06-05 18:05:19 +02:00
Matthew Auld
a63e99b4d6 drm/xe/vm: move rebind_work init earlier
In xe_vm_close_and_put() we need to be able to call
flush_work(rebind_work), however during vm creation we can call this on
the error path, before having actually set up the worker, leading to a
splat from flush_work().

It looks like we can simply move the worker init step earlier to fix
this.

Fixes: dd08ebf6c3 ("drm/xe: Introduce a new DRM driver for Intel GPUs")
Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: <stable@vger.kernel.org> # v6.8+
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://lore.kernel.org/r/20250514152424.149591-3-matthew.auld@intel.com
(cherry picked from commit 96af397aa1)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
2025-06-05 18:05:10 +02:00
Dave Airlie
61a4b7d7d5 Merge tag 'drm-msm-fixes-2025-06-04' of https://gitlab.freedesktop.org/drm/msm into drm-next
Just some mailmap/MAINTAINER email address updates

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Rob Clark <robdclark@gmail.com>
Link: https://lore.kernel.org/r/CAF6AEGuW1kpUijxzLCc4TKph72-PfXvHTttNu214_1GB-R8qwg@mail.gmail.com
2025-06-05 12:39:09 +10:00
Rob Clark
ee11d953fd MAINTAINERS: .mailmap: update Rob Clark's email address
Remap historical email addresses to @oss.qualcomm.com.

Signed-off-by: Rob Clark <robin.clark@oss.qualcomm.com>
Acked-by: Dmitry Baryshkov <dmitry.baryshkov@oss.qualcomm.com>
Reviewed-by: Bjorn Andersson <andersson@kernel.org>
Patchwork: https://patchwork.freedesktop.org/patch/656974/
2025-06-04 13:26:41 -07:00
Akhil P Oommen
cb46078168 mailmap: Update entry for Akhil P Oommen
A new policy within qualcomm requires me to use a new email address
for all future contributions to Linux kernel. Update the mailmap to
map my old email addresses to the new one, ie akhilpo@oss.qualcomm.com

Signed-off-by: Akhil P Oommen <akhilpo@oss.qualcomm.com>
Signed-off-by: Akhil P Oommen <quic_akhilpo@quicinc.com>
Link: https://lore.kernel.org/lkml/20250603121508.296678-1-quic_akhilpo@quicinc.com
Signed-off-by: Rob Clark <robin.clark@oss.qualcomm.com>
2025-06-04 13:24:56 -07:00
Abhinav Kumar
d6984d0c0a MAINTAINERS: update my email address
My current email address will stop working soon. Use
linux.dev email instead.

Signed-off-by: Abhinav Kumar <quic_abhinavk@quicinc.com>
Acked-by: Rob Clark <robdclark@gmail.com>
Patchwork: https://patchwork.freedesktop.org/patch/655555/
Signed-off-by: Rob Clark <robin.clark@oss.qualcomm.com>
2025-06-04 13:21:05 -07:00
Abhinav Kumar
11bb662bfa MAINTAINERS: drop myself as maintainer
I will no longer regularly work on this platform. Hence will
step down from maintainer duties.

Also, add Jessica as a reviewer to the MSM DRM subsystem to help
out with the reviews.

Signed-off-by: Abhinav Kumar <quic_abhinavk@quicinc.com>
Acked-by: Rob Clark <robdclark@gmail.com>
Patchwork: https://patchwork.freedesktop.org/patch/655558/
Signed-off-by: Rob Clark <robin.clark@oss.qualcomm.com>
2025-06-04 13:19:27 -07:00
Dibin Moolakadan Subrahmanian
791d76005d drm/i915/display: Fix u32 overflow in SNPS PHY HDMI PLL setup
When configuring the HDMI PLL, calculations use DIV_ROUND_UP_ULL and
DIV_ROUND_DOWN_ULL macros, which internally rely on do_div. However, do_div
expects a 32-bit (u32) divisor, and at higher data rates, the divisor can
exceed this limit. This leads to incorrect division results and
ultimately misconfigured PLL values.
This fix replaces do_div calls with  div64_base64 calls where diviser
can exceed u32 limit.

Fixes: 5947642004 ("drm/i915/display: Add support for SNPS PHY HDMI PLL algorithm for DG2")
Cc: Ankit Nautiyal <ankit.k.nautiyal@intel.com>
Cc: Suraj Kandpal <suraj.kandpal@intel.com>
Cc: Jani Nikula <jani.nikula@intel.com>
Signed-off-by: Dibin Moolakadan Subrahmanian <dibin.moolakadan.subrahmanian@intel.com>
Reviewed-by: Ankit Nautiyal <ankit.k.nautiyal@intel.com>
Signed-off-by: Ankit Nautiyal <ankit.k.nautiyal@intel.com>
Link: https://lore.kernel.org/r/20250528064557.4172149-1-dibin.moolakadan.subrahmanian@intel.com
(cherry picked from commit ce924116e4)
Signed-off-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
2025-06-04 17:30:53 +03:00
Jesus Narvaez
0323a5127e drm/i915/guc: Handle race condition where wakeref count drops below 0
There is a rare race condition when preparing for a reset where
guc_lrc_desc_unpin() could be in the process of deregistering a context
while a different thread is scrubbing outstanding contexts and it alters
the context state and does a wakeref put. Then, if there is a failure
with deregister_context(), a second wakeref put could occur. As a result
the wakeref count could drop below 0 and fail an INTEL_WAKEREF_BUG_ON()
check.

Therefore if there is a failure with deregister_context(), undo the
context state changes and do a wakeref put only if the context was set
to be destroyed earlier.

v2: Expand comment to better explain change. (Daniele)
v3: Removed addition to the original comment. (Daniele)

Fixes: 2f2cc53b5f ("drm/i915/guc: Close deregister-context race against CT-loss")
Signed-off-by: Jesus Narvaez <jesus.narvaez@intel.com>
Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Cc: Alan Previn <alan.previn.teres.alexis@intel.com>
Cc: Anshuman Gupta <anshuman.gupta@intel.com>
Cc: Mousumi Jana <mousumi.jana@intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Matt Roper <matthew.d.roper@intel.com>
Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Link: https://lore.kernel.org/r/20250528230551.1855177-1-jesus.narvaez@intel.com
(cherry picked from commit f36a75aba1)
Signed-off-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
2025-06-02 12:19:14 +03:00
Jouni Högander
57d63c6cd0 drm/i915/psr: Fix using wrong mask in REG_FIELD_PREP
Wrong mask is used in PORT_ALPM_LFPS_CTL_FIRST_LFPS_HALF_CYCLE_DURATION and
PORT_ALPM_LFPS_CTL_LAST_LFPS_HALF_CYCLE_DURATION.

Fixes: 295099580f ("drm/i915/psr: Add missing ALPM AUX-Less register definitions")
Signed-off-by: Jouni Högander <jouni.hogander@intel.com>
Reviewed-by: Ankit Nautiyal <ankit.k.nautiyal@intel.com>
Link: https://lore.kernel.org/r/20250526120512.1702815-12-jouni.hogander@intel.com
(cherry picked from commit 8097128a40)
Signed-off-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
2025-06-02 12:19:11 +03:00
Jesus Narvaez
c557fd1050 drm/i915/guc: Check if expecting reply before decrementing outstanding_submission_g2h
When sending a H2G message where a reply is expected in
guc_submission_send_busy_loop(), outstanding_submission_g2h is
incremented before the send. However, if there is an error sending the
message, outstanding_submission_g2h is decremented without checking if a
reply is expected.

Therefore, check if reply is expected when there is a failure before
decrementing outstanding_submission_g2h.

Fixes: 2f2cc53b5f ("drm/i915/guc: Close deregister-context race against CT-loss")
Signed-off-by: Jesus Narvaez <jesus.narvaez@intel.com>
Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Cc: Alan Previn <alan.previn.teres.alexis@intel.com>
Cc: Anshuman Gupta <anshuman.gupta@intel.com>
Cc: Mousumi Jana <mousumi.jana@intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Matt Roper <matthew.d.roper@intel.com>
Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Link: https://lore.kernel.org/r/20250514225224.4142684-1-jesus.narvaez@intel.com
(cherry picked from commit a6a26786f2)
Signed-off-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
2025-06-02 12:19:08 +03:00
Dave Airlie
1c1df79ccf Merge tag 'amd-drm-fixes-6.16-2025-05-29' of https://gitlab.freedesktop.org/agd5f/linux into drm-next
amd-drm-fixes-6.16-2025-05-29:

amdgpu:
- UserQ fixes
- SMU 13.x fixes
- VCN fixes
- JPEG fixes
- Misc cleanups
- runtime pm fix
- DCN 4.0.1 fixes
- Misc display fixes
- ISP fix
- VRAM manager fix
- RAS fixes

amdkfd:
- SVM fix
- Misc cleanups
- Ref leak fix
- WPTR BO fix

radeon:
- Misc cleanups

Signed-off-by: Dave Airlie <airlied@redhat.com>

From: Alex Deucher <alexander.deucher@amd.com>
Link: https://lore.kernel.org/r/20250529205215.6790-1-alexander.deucher@amd.com
2025-06-02 11:56:06 +10:00
Lang Yu
30837a49bd drm/amdkfd: Map wptr BO to GART unconditionally
For simulation C models that don't run CP FW where adev->mes.sched_version
is not populated correctly. This causes NULL dereference in
amdgpu_amdkfd_free_gtt_mem(dev->adev, (void **)&pqn->q->wptr_bo_gart)
and warning on unpinned BO in amdgpu_bo_gpu_offset(q->properties.wptr_bo).

Compared with adding version check here and there,
always map wptr BO to GART simplifies things.

v2: Add NULL check in amdgpu_amdkfd_free_gtt_mem.(Philip)

Signed-off-by: Lang Yu <lang.yu@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-29 10:58:44 -04:00
Alex Deucher
684530526f drm/amdgpu/mes: remove some unused functions
Nothing uses them so remove them.  Leftover from
MES bring up.

Reviewed-by: Michael Chen <michael.chen@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-29 10:58:39 -04:00
Alex Deucher
40f970ba7a drm/amdgpu/mes: add missing locking in helper functions
We need to take the MES lock.

Reviewed-by: Michael Chen <michael.chen@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
2025-05-29 10:58:10 -04:00
Mario Limonciello
82a277d529 drm/amd: Export DMCUB version to sysfs
For supported ASICs DMCU version is exported, but ASICs that support
DMCUB there is no information exported to sysfs.

Add an attribute for DMCUB.

Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Link: https://lore.kernel.org/r/20250527155942.476354-1-mario.limonciello@amd.com
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-29 10:57:42 -04:00
Asad Kamal
324194b528 drm/amd/pm: Enable static metrics table support
Enable static metrics support to fetch board voltage and pldm version
for smu_v13_0_14

Signed-off-by: Asad Kamal <asad.kamal@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-29 10:57:39 -04:00
Asad Kamal
c7d43a1485 drm/amd/pm: Enable static metrics table support
Enable static metrics support to fetch board voltage and pldm version
for other smu_v13_0_6 program

Signed-off-by: Asad Kamal <asad.kamal@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-29 10:57:36 -04:00
Christophe JAILLET
347efe5b39 drm/amd/display: Constify struct timing_generator_funcs
'struct timing_generator_funcs' are not modified in these drivers.

Constifying these structures moves some data to a read-only section, so
increases overall security, especially when the structure holds some
function pointers.

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Alex Hung <alex.hung@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-29 10:57:31 -04:00
Wentao Liang
c3e9826a22 drm/amd/display: Add null pointer check for get_first_active_display()
The function mod_hdcp_hdcp1_enable_encryption() calls the function
get_first_active_display(), but does not check its return value.
The return value is a null pointer if the display list is empty.
This will lead to a null pointer dereference in
mod_hdcp_hdcp2_enable_encryption().

Add a null pointer check for get_first_active_display() and return
MOD_HDCP_STATUS_DISPLAY_NOT_FOUND if the function return null.

Fixes: 2deade5ede ("drm/amd/display: Remove hdcp display state with mst fix")
Signed-off-by: Wentao Liang <vulab@iscas.ac.cn>
Reviewed-by: Alex Hung <alex.hung@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org # v5.8
2025-05-29 10:57:13 -04:00
ganglxie
fce0afca35 drm/amdgpu: Get mca address for old eeprom records
after getting mca address for old eeprom records with 'address==0', it can be
correctly parsed under none-nps1, or it will be dropped.

Signed-off-by: ganglxie <ganglxie@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-29 10:57:09 -04:00
ganglxie
31e837d242 drm/amdgpu: handle old RAS eeprom data in non-nps1 mode
Get MCA address from PA in nps1, then convert MCA address to PA in specific nps
mode.

Signed-off-by: ganglxie <ganglxie@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-29 10:57:05 -04:00
Yifan Zhang
90237b16ec amd/amdkfd: fix a kfd_process ref leak
This patch is to fix a kfd_prcess ref leak.

Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com>
Reviewed-by: Philip Yang <Philip.Yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-29 10:57:02 -04:00
Arunpravin Paneer Selvam
5ae9de5867 drm/amdgpu: Add userq fence support to SDMAv6.0
Add userq fence support to SDMAv6.0

Signed-off-by: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-29 10:56:58 -04:00
Sunday Clement
1091fba163 drm/amdkfd: Identical code for different branches
This patch removes the if/else statement in the
cik_event_interrupt_wq function because it is redundant
with both branches resulting in identical outcomes,
this improves code readibility.

Signed-off-by: Sunday Clement <Sunday.Clement@amd.com>
Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-29 10:56:54 -04:00
Asad Kamal
978592136c drm/amd/pm: Optimize get gpu metrics data function
Optimize get gpu metrics data function for smu_v13_0_12 to
allocate metrics structure only once

v2: Free and alloc moved to same function(Kevin)

Signed-off-by: Asad Kamal <asad.kamal@amd.com>
Reviewed-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-29 10:56:50 -04:00
John Olender
4d2f6b4e4c drm/amdgpu: amdgpu_vram_mgr_new(): Clamp lpfn to total vram
The drm_mm allocator tolerated being passed end > mm->size, but the
drm_buddy allocator does not.

Restore the pre-buddy-allocator behavior of allowing such placements.

Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3448
Signed-off-by: John Olender <john.olender@gmail.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
2025-05-29 10:56:23 -04:00
David (Ming Qiang) Wu
bf394d2854 drm/amdgpu/vcn5.0.1: read back register after written
The addition of register read-back in VCN v5.0.1 is intended to prevent
potential race conditions.

Reviewed-by: Ruijing Dong <ruijing.dong@amd.com>
Signed-off-by: David (Ming Qiang) Wu <David.Wu3@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-29 10:56:19 -04:00
David (Ming Qiang) Wu
a8bce9b7a2 drm/amdgpu/vcn5: read back register after written
The addition of register read-back in VCN v5.0.0 is intended to prevent
potential race conditions.

Reviewed-by: Ruijing Dong <ruijing.dong@amd.com>
Signed-off-by: David (Ming Qiang) Wu <David.Wu3@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-29 10:56:17 -04:00
David (Ming Qiang) Wu
4d4275a038 drm/amdgpu/vcn4.0.5: read back register after written
The addition of register read-back in VCN v4.0.5 is intended to prevent
potential race conditions.

Reviewed-by: Ruijing Dong <ruijing.dong@amd.com>
Signed-off-by: David (Ming Qiang) Wu <David.Wu3@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-29 10:56:14 -04:00
David (Ming Qiang) Wu
5b4c6413c8 drm/amdgpu/vcn4.0.3: read back register after written
The addition of register read-back in VCN v4.0.3 is intended to prevent
potential race conditions.

Reviewed-by: Ruijing Dong <ruijing.dong@amd.com>
Signed-off-by: David (Ming Qiang) Wu <David.Wu3@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-29 10:56:12 -04:00
David (Ming Qiang) Wu
a3810a5e37 drm/amdgpu/vcn4: read back register after written
The addition of register read-back in VCN v4.0.0 is intended to prevent
potential race conditions.

Reviewed-by: Ruijing Dong <ruijing.dong@amd.com>
Signed-off-by: David (Ming Qiang) Wu <David.Wu3@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-29 10:56:10 -04:00
David (Ming Qiang) Wu
b7a4842a91 drm/amdgpu/vcn3: read back register after written
The addition of register read-back in VCN v3.0 is intended to prevent
potential race conditions.

Reviewed-by: Ruijing Dong <ruijing.dong@amd.com>
Signed-off-by: David (Ming Qiang) Wu <David.Wu3@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-29 10:56:08 -04:00
David (Ming Qiang) Wu
d9e688b914 drm/amdgpu/vcn2.5: read back register after written
The addition of register read-back in VCN v2.5 is intended to prevent
potential race conditions.

Reviewed-by: Ruijing Dong <ruijing.dong@amd.com>
Signed-off-by: David (Ming Qiang) Wu <David.Wu3@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-29 10:56:05 -04:00
David (Ming Qiang) Wu
8c5ed7f5ab drm/amdgpu/vcn2: read back register after written
The addition of register read-back in VCN v2.0 is intended to prevent
potential race conditions.

Reviewed-by: Ruijing Dong <ruijing.dong@amd.com>
Signed-off-by: David (Ming Qiang) Wu <David.Wu3@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-29 10:56:03 -04:00
Fangzhi Zuo
988b1d2164 Revert "drm/amd/display: pause the workload setting in dm"
This reverts commit 50f29ead1f.

Reason for revert: cause corruption on Dell U3224KB DP2 display.

Signed-off-by: Fangzhi Zuo <Jerry.Zuo@amd.com>
Reviewed-by: Kenneth Feng <kenneth.feng@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-29 10:55:47 -04:00