Commit Graph

1369218 Commits

Author SHA1 Message Date
Jesse.Zhang
655d6403ad drm/amd/vcn: Add late_init callback for VCN v4.0.3 reset handling
This change reorganizes VCN reset capability detection by:

1. Moving reset mask configuration from sw_init to new late_init phase
2. Adding vcn_v4_0_3_late_init() to properly check for per-queue reset support
3. Only setting soft full reset mask as fallback when per-queue reset isn't supported
4. Removing TODO comment now that queue reset support is implemented

V2: Removed unrelated changes. Keep amdgpu_get_soft_full_reset_mask in place
    and remove TODO comment. (Alex)
v3: set the flags at one place (all in late_init) (Lijo)

Suggested-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Ruili Ji <ruiliji2@amd.com>
Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-15 13:05:06 -04:00
Kent Russell
0ed704d058 drm/amdkfd: Handle lack of READ permissions in SVM mapping
HMM assumes that pages have READ permissions by default. Inside
svm_range_validate_and_map, we add READ permissions then add WRITE
permissions if the VMA isn't read-only. This will conflict with regions
that only have PROT_WRITE or have PROT_NONE. When that happens,
svm_range_restore_work will continue to retry, silently, giving the
impression of a hang if pr_debug isn't enabled to show the retries..

If pages don't have READ permissions, simply unmap them and continue. If
they weren't mapped in the first place, this would be a no-op. Since x86
doesn't support write-only, and PROT_NONE doesn't allow reads or writes
anyways, this will allow the svm range validation to continue without
getting stuck in a loop forever on mappings we can't use with HMM.

Signed-off-by: Kent Russell <kent.russell@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-15 13:04:41 -04:00
Jesse.Zhang
9d20f37a10 drm/amd/pm: Add VCN reset support for SMU v13.0.6
This commit implements VCN reset capability for SMU v13.0.6 with the following changes:

1. Added new PPSMC message ID (0x5B) for VCN reset in SMU firmware interface
2. Extended SMU capabilities to include VCN_RESET support
3. Implemented VCN reset support check:
   - Added smu_v13_0_6_reset_vcn_is_supported() function
4. Updated SMU v13.0.6 PPT functions to include VCN reset operations

v2: clean up debug info (Alex)
v3: remove unsupported message and split smu v13.0.6 changes to a separate patch (Lijo)
v4: simply the function (smu_v13_0_6_reset_vcn_is_supported) (Lijo)

Suggested-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Ruili Ji <ruiliji2@amd.com>
Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-15 13:04:33 -04:00
Jesse.Zhang
37b9257be7 drm/amd/pm: Add VCN reset support check capability
This change introduces infrastructure to check whether VCN reset
is supported by the SMU firmware. Key changes include:

1. Added new functions to query VCN reset support:
   - amdgpu_dpm_reset_vcn_is_supported()
   - smu_reset_vcn_is_supported()
   - pptable_funcs.reset_vcn_is_supported callback

2. Implemented proper locking in the DPM layer with mutex protection

3. Maintained consistency with existing SDMA reset support checks

The new capability allows callers to check for VCN reset support
before attempting the operation, preventing unnecessary attempts
on unsupported platforms.

v2: clean up debug info(Alex)

Suggested-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Ruili Ji <ruiliji2@amd.com>
Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-15 13:04:25 -04:00
Heng Zhou
859958a7fa drm/amdgpu: fix nullptr err of vm_handle_moved
If a amdgpu_bo_va is fpriv->prt_va, the bo of this one is always NULL.
So, such kind of amdgpu_bo_va should be updated separately before
amdgpu_vm_handle_moved.

Signed-off-by: Heng Zhou <Heng.Zhou@amd.com>
Reviewed-by: Kasiviswanathan, Harish <Harish.Kasiviswanathan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-15 13:04:07 -04:00
Eric Huang
3a75edf93a drm/amdkfd: set uuid for each partition in topology
Currently each kfd compute partition/node is sharing
the same uuid of AID, which doen't meet the CUDA spec
for visible device, so corresponding XCD id for each
partition in smu has been assigned to xcp, and exposed
to kfd topology.

v2: add NULL check (Lijo)

Signed-off-by: Eric Huang <jinhuieric.huang@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-15 13:04:02 -04:00
Qianfeng Rong
5c8d5e2619 drm/amd/display: Use boolean context for pointer null checks
Replace "out == 0" with "!out" for pointer comparison to improve code
readability and conform to coding style.

Signed-off-by: Qianfeng Rong <rongqianfeng@vivo.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-15 13:03:57 -04:00
Liao Yuanhong
90b810dd85 drm/amd/display: Remove redundant semicolons
Remove unnecessary semicolons.

Fixes: dda4fb85e4 ("drm/amd/display: DML changes for DCN32/321")
Signed-off-by: Liao Yuanhong <liaoyuanhong@vivo.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-15 13:03:26 -04:00
Xichao Zhao
3e03525ce1 drm/radeon: replace min/max nesting with clamp()
The clamp() macro explicitly expresses the intent of constraining
a value within bounds.Therefore, replacing min(max(a, b), c) and
max(min(a,b),c) with clamp(val, lo, hi) can improve code readability.

Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Xichao Zhao <zhao.xichao@vivo.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-15 13:03:21 -04:00
Liu01 Tong
f101c13a87 drm/amdgpu: fix task hang from failed job submission during process kill
During process kill, drm_sched_entity_flush() will kill the vm
entities. The following job submissions of this process will fail, and
the resources of these jobs have not been released, nor have the fences
been signalled, causing tasks to hang and timeout.

Fix by check entity status in amdgpu_vm_ready() and avoid submit jobs to
stopped entity.

v2: add amdgpu_vm_ready() check before amdgpu_vm_clear_freed() in
function amdgpu_cs_vm_handling().

Signed-off-by: Liu01 Tong <Tong.Liu01@amd.com>
Signed-off-by: Lin.Cao <lincao12@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-12 14:22:54 -04:00
Jack Xiao
b08425fa77 drm/amdgpu: fix incorrect vm flags to map bo
It should use vm flags instead of pte flags
to specify bo vm attributes.

Fixes: 7946340fa3 ("drm/amdgpu: Move csa related code to separate file")
Signed-off-by: Jack Xiao <Jack.Xiao@amd.com>
Reviewed-by: Likun Gao <Likun.Gao@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-12 14:19:57 -04:00
YiPeng Chai
d38eaf27de drm/amdgpu: fix vram reservation issue
The vram block allocation flag must be cleared
before making vram reservation, otherwise reserving
addresses within the currently freed memory range
will always fail.

Fixes: c9cad937c0 ("drm/amdgpu: add drm buddy support to amdgpu")
Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-12 14:15:03 -04:00
Geoffrey McRae
57af162bfc drm/amdkfd: return -ENOTTY for unsupported IOCTLs
Some kfd ioctls may not be available depending on the kernel version the
user is running, as such we need to report -ENOTTY so userland can
determine the cause of the ioctl failure.

Signed-off-by: Geoffrey McRae <geoffrey.mcrae@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-12 14:14:54 -04:00
Frank Min
065e23170a drm/amdgpu: Add PSP fw version check for fw reserve GFX command
The fw reserved GFX command is only supported starting from PSP fw
version 0x3a0e14 and 0x3b0e0d. Older versions do not support this command.

Add a version guard to ensure the command is only used when the running
PSP fw meets the minimum version requirement.

This ensures backward compatibility and safe operation across fw
revisions.

Fixes: a3b7f9c306 ("drm/amdgpu: reclaim psp fw reservation memory region")
Signed-off-by: Frank Min <Frank.Min@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-12 14:13:03 -04:00
Lijo Lazar
d543489aa1 drm/amdgpu: Add description for partition commands
Add string description for partition commands.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Asad Kamal <asad.kamal@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-12 14:12:57 -04:00
Brahmajit Das
260dcf5b06 drm/radeon/r600_cs: clean up of dead code in r600_cs
GCC 16 enables -Werror=unused-but-set-variable= which results in build
error with the following message.

drivers/gpu/drm/radeon/r600_cs.c: In function ‘r600_texture_size’:
drivers/gpu/drm/radeon/r600_cs.c:1411:29: error: variable ‘level’ set but not used [-Werror=unused-but-set-variable=]
 1411 |         unsigned offset, i, level;
      |                             ^~~~~
cc1: all warnings being treated as errors
make[6]: *** [scripts/Makefile.build:287: drivers/gpu/drm/radeon/r600_cs.o] Error 1

level although is set, but in never used in the function
r600_texture_size. Thus resulting in dead code and this error getting
triggered.

Fixes: 60b212f8dd ("drm/radeon: overhaul texture checking. (v3)")
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Brahmajit Das <listout@listout.xyz>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-11 11:13:44 -04:00
Cryolitia PukNgae
388b68aef7 drm/amdgpu: fix incorrect comment format
Comments should not have a leading plus sign.

Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Cryolitia PukNgae <cryolitia@uniontech.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-11 11:13:41 -04:00
Taimur Hassan
8d6593c192 drm/amd/display: Promote DC to 3.2.345
This version brings along following update:
-Fix close and open lid may cause eDP remaining blank
-Fix frequently disabling/enabling OTG may cause incorrect
 configuration of OTG

Acked-by: Wayne Lin <wayne.lin@amd.com>
Signed-off-by: Taimur Hassan <Syed.Hassan@amd.com>
Signed-off-by: Tom Chung <chiahsuan.chung@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-11 11:13:38 -04:00
Taimur Hassan
7552bee9dd drm/amd/display: [FW Promotion] Release 0.1.22.0
Add a new command for Panel Replay.

Acked-by: Wayne Lin <wayne.lin@amd.com>
Signed-off-by: Taimur Hassan <Syed.Hassan@amd.com>
Signed-off-by: Tom Chung <chiahsuan.chung@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-11 11:13:35 -04:00
Danny Wang
ad335b5fc9 drm/amd/display: Reset apply_eamless_boot_optimization when dpms_off
[WHY&HOW]
The user closed the lid while the system was powering on and opened it
again before the “apply_seamless_boot_optimization” was set to false,
resulting in the eDP remaining blank.
Reset the “apply_seamless_boot_optimization” to false when dpms off.

Reviewed-by: Nicholas Kazlauskas <nicholas.kazlauskas@amd.com>
Signed-off-by: Danny Wang <Danny.Wang@amd.com>
Signed-off-by: Tom Chung <chiahsuan.chung@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-11 11:13:26 -04:00
TungYu Lu
e7496c15d8 drm/amd/display: Wait until OTG enable state is cleared
[Why]
Customer reported an issue that OS starts and stops device multiple times
during driver installation. Frequently disabling and enabling OTG may
prevent OTG from being safely disabled and cause incorrect configuration
upon the next enablement.

[How]
Add a wait until OTG_CURRENT_MASTER_EN_STATE is cleared as a short term
solution.

Reviewed-by: Dillon Varone <dillon.varone@amd.com>
Signed-off-by: TungYu Lu <tungyu.lu@amd.com>
Signed-off-by: Tom Chung <chiahsuan.chung@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-11 11:13:21 -04:00
Vitaly Prosyak
c31f486bc8 drm/amdgpu: add to custom amdgpu_drm_release drm_dev_enter/exit
User queues are disabled before GEM objects are released
(protecting against user app crashes).
No races with PCI hot-unplug (because drm_dev_enter prevents cleanup
if iewdevice is being removed).

Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-11 11:13:03 -04:00
Lijo Lazar
1dd2fa0e00 drm/amdgpu: Save and restore switch state
During a DPC error kernel waits for the link to be active before
notifying downstream devices. On certain platforms with Broadcom switch
in synthetiic mode, switch responds with values even though the link is
not fully ready. The config space restoration done by pcie port driver
for SWUS/DS of dGPU is thus not effective as the switch is still doing
internal enumeration.

As a workaround, save state of SWUS/DS device in driver. Add additional
check to see if link is active and restore the values during DPC error
callbacks.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Yang Wang <kevinyang.wang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-11 11:12:53 -04:00
Sathishkumar S
111821e4b5 drm/amdgpu/vcn: Hold pg_lock before vcn power off
Acquire vcn_pg_lock before changes to vcn power state
and release it after power off in idle work handler.

Signed-off-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Reviewed-by: Leo Liu <leo.liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-06 14:34:54 -04:00
Sathishkumar S
0e7581eda8 drm/amdgpu/jpeg: Hold pg_lock before jpeg poweroff
Acquire jpeg_pg_lock before changes to jpeg power state
and release it after power off from idle work handler.

Signed-off-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Reviewed-by: Leo Liu <leo.liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-06 14:34:50 -04:00
Lijo Lazar
0b4d79dafa drm/amdgpu: Assign unique id to compute partition
Assign unique id to compute partition. This is the unique id of the
first XCD instance belonging to the partition.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Asad Kamal <asad.kamal@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-06 14:34:47 -04:00
Lijo Lazar
6fa8216854 drm/amd/pm: Add unique ids for SMUv13.0.12 SOCs
Fetch and store the unique ids for AIDs/XCDs in SMUv13.0.12 SOCs.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Asad Kamal <asad.kamal@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-06 14:34:42 -04:00
Alex Deucher
aae94897b6 drm/amdgpu: add missing vram lost check for LEGACY RESET
Legacy resets reset the memory controllers so VRAM contents
may be unreliable after reset.

Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-06 14:34:14 -04:00
Alex Deucher
62eedd150f drm/amdgpu/discovery: fix fw based ip discovery
We only need the fw based discovery table for sysfs.  No
need to parse it.  Additionally parsing some of the board
specific tables may result in incorrect data on some boards.
just load the binary and don't parse it on those boards.

Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4441
Fixes: 80a0e82829 ("drm/amdgpu/discovery: optionally use fw based ip discovery")
Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-06 14:32:38 -04:00
Srinivasan Shanmugam
9d69391794 drm/amd/display: Add NULL check for stream before dereference in 'dm_vupdate_high_irq'
Add a NULL check for acrtc->dm_irq_params.stream before
accessing its members.

Fixes below:
drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:623
dm_vupdate_high_irq() warn: variable dereferenced before check
'acrtc->dm_irq_params.stream' (see line 615)

614	if (vrr_active) {
615		bool replay_en = acrtc->dm_irq_params.stream->link->replay_settings.replay_feature_enabled;
				 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
616		bool psr_en = acrtc->dm_irq_params.stream->link->psr_settings.psr_feature_enabled;
			      ^^^^^^^^^^^^^^^^^^^^^^^^^^^ New dereferences

617		bool fs_active_var_en = acrtc->dm_irq_params.freesync_config.state
618			== VRR_STATE_ACTIVE_VARIABLE;
619
620		amdgpu_dm_crtc_handle_vblank(acrtc);
621
622		/* BTR processing for pre-DCE12 ASICs */
623		if (acrtc->dm_irq_params.stream &&
		    ^^^^^^^^^^^^^^^^^^^^^^^^^^^ But the existing code assumed it could be NULL. Someone is wrong.

624		    adev->family < AMDGPU_FAMILY_AI) {
625			spin_lock_irqsave(&adev_to_drm(adev)->event_lock, flags);

Fixes: 6d31602a9f ("drm/amd/display: more liberal vmin/vmax update for freesync")
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Cc: Alex Hung <alex.hung@amd.com>
Cc: Aurabindo Pillai <aurabindo.pillai@amd.com>
Cc: Roman Li <roman.li@amd.com>
Cc: ChiaHsuan Chung <chiahsuan.chung@amd.com>
Cc: Harry Wentland <harry.wentland@amd.com>
Cc: Ray Wu <ray.wu@amd.com>
Cc: Daniel Wheeler <daniel.wheeler@amd.com>
Cc: Nicholas Kazlauskas <nicholas.kazlauskas@amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Reviewed-by: Aurabindo Pillai <aurabindo.pillai@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-06 14:30:54 -04:00
Lijo Lazar
5bf93e1d6e drm/amd/pm: Add caching to SMUv13.0.12 temp metric
Add table caching logic to temperature metrics tables in SMUv13.0.12

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Asad Kamal <asad.kamal@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-06 14:30:51 -04:00
Lijo Lazar
476060020f drm/amd/pm: Add cache logic for temperature metric
Add caching logic for baseboard and gpuboard temperature metrics tables.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Asad Kamal <asad.kamal@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-06 14:30:47 -04:00
Lijo Lazar
615471b860 drm/amd/pm: Remove cache logic from SMUv13.0.12
Remove caching logic of temperature metrics from SMUv13.0.12. The
caching logic needs to be moved to a higher level.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Asad Kamal <asad.kamal@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-06 14:30:43 -04:00
Lijo Lazar
10c12aae4e drm/amd/pm: Add unique ids for SMUv13.0.6 SOCs
Fetch and store the unique ids for AIDs/XCDs in SMUv13.0.6 SOCs.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Asad Kamal <asad.kamal@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-06 14:22:11 -04:00
Lijo Lazar
589ea8a1fd drm/amdgpu: Add helpers to set/get unique ids
Add a struct to store unique id information for each type. Add helper
to fetch the unique id.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Asad Kamal <asad.kamal@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-06 14:22:08 -04:00
Lijo Lazar
892bac995b drm/amdgpu: Prevent hardware access in dpc state
Don't allow hardware access while in dpc state.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Ce Sun <cesun102@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-06 14:22:03 -04:00
Lijo Lazar
1a0e57eb96 drm/amdgpu/vcn: Fix double-free of vcn dump buffer
The buffer is already freed as part of amdgpu_vcn_reg_dump_fini(). The
issue is introduced by below patch series.

Fixes: de55cbff5c ("drm/amdgpu/vcn: Add regdump helper functions")
Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-06 14:21:16 -04:00
Lijo Lazar
c5c62160a5 drm/amdgpu: Log reset source during recovery
To get more context, add reset source to identify the source of gpu
recovery - job timeout, RAS, HWS hang etc.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Asad Kamal <asad.kamal@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-06 14:21:05 -04:00
Xiang Liu
b3505c2c48 drm/amdgpu: Generate BP threshold exceed CPER once threshold exceeded
The bad pages threshold exceed CPER should be generated once threshold
exceeded, no matter the bad_page_threshold setted or not.

Signed-off-by: Xiang Liu <xiang.liu@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-06 14:21:01 -04:00
Asad Kamal
d9f6a07043 drm/amd/pm: Enable temperature metrics caps
Enable temperature metrics caps for smu_v13_0_12

Signed-off-by: Asad Kamal <asad.kamal@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-06 14:20:57 -04:00
Asad Kamal
25e82f2e2c drm/amd/pm: Add temperature metrics sysfs entry
Add temperature metrics sysfs entry to expose gpuboard/baseboard
temperature metrics

v2: Removed unused function, rename functions(Lijo)

v3: Remove unnecessary initialization

Signed-off-by: Asad Kamal <asad.kamal@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-06 14:20:55 -04:00
Asad Kamal
33074558ec drm/amd/pm: Fetch and fill temperature metrics
Fetch system metrics table to fill gpuboard/baseboard temperature
metrics data for smu_v13_0_12

v2: Remove unnecessary checks, used separate metrics time for
temperature metrics table(Lijo)

v3: Use cached values for back to back system metrics query(Lijo)

Signed-off-by: Asad Kamal <asad.kamal@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-06 14:20:51 -04:00
Asad Kamal
793ff2bafe drm/amd/pm: Update pmfw header for smu_v13_0_12
Update pmfw header for smu_v13_0_12 with system temperature metrics
table

Signed-off-by: Asad Kamal <asad.kamal@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-06 14:20:47 -04:00
Asad Kamal
775c7e8e4d drm/amd/pm: Add smu interface for temp metrics
Add smu interface to get baseboard/gpuboard temperature metrics

v2: Rename is_support to is_supported(Lijo)

Signed-off-by: Asad Kamal <asad.kamal@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-06 14:20:44 -04:00
Asad Kamal
83953ec1fe drm/amd/pm: Add dpm interface for temp metrics
Add dpm interface to get gpuboard/baseboard temperature metrics

v2: Add temperature metrics support check(Lijo)

v3: Return error code in case of operation not supported(Lijo)

Signed-off-by: Asad Kamal <asad.kamal@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-06 14:20:34 -04:00
Aurabindo Pillai
e9c840d450 drm/amd/display: Fix vupdate_offload_work doc
Fix the following warning in struct documentation:

drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.h:168: warning: expecting prototype for struct dm_vupdate_work. Prototype was for struct vupdate_offload_work instead

Fixes: c210b757b4 ("drm/amd/display: fix dmub access race condition")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Aurabindo Pillai <aurabindo.pillai@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-06 14:19:44 -04:00
James Zhu
bd6093e2f1 drm/amdkfd: return migration pages from copy function
dst MIGRATE_PFN_VALID bit and src MIGRATE_PFN_MIGRATE bit
should always be set when migration success. cpage includes
src MIGRATE_PFN_MIGRATE bit set and MIGRATE_PFN_VALID bit
unset pages for both ram and vram when memory is only allocated
without being populated before migration, those ram pages should
be counted as migrate pages and those vram pages should not be
counted as migrate pages. Here migration pages refer to how many
vram pages involved.

-v2 use dst to check MIGRATE_PFN_VALID bit (suggested-by Philip)
-v3 add warning when vram pages is less than migration pages
    return migration pages directly from copy function
-v4 correct comments and copy function return mpage (suggested-by Felix)

Signed-off-by: James Zhu <James.Zhu@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-06 14:19:12 -04:00
James Zhu
5c15a05b52 drm/amdkfd: remove unused code
upages is assigned under cpages = 0, so it isn't really used in this function.

Signed-off-by: James Zhu <James.Zhu@amd.com>
Reviewed-by: Philip.Yang<Philip.Yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-06 14:19:01 -04:00
Lijo Lazar
6ec7120dec drm/amd/pm: Add priority messages for SMU v13.0.6
Certain messages will processed with high priority by PMFW even if it
hasn't responded to a previous message. Send the priority message
regardless of the success/fail status of the previous message. Add
support on SMUv13.0.6 and SMUv13.0.12

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Yang Wang <kevinyang.wang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-06 14:18:51 -04:00
Lijo Lazar
91c4fd4164 drm/amdgpu: Set dpc status appropriately
Set the dpc status based on hardware state. Also, clear the status before
reinitialization after a successful reset.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Ce Sun <cesun102@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-06 14:18:35 -04:00