[Why]
drm_dp_mst_topology_queue_probe() is used under the assumption that
mst is already initialized. If we connect system with SST first
then switch to the mst branch during suspend, we will fail probing
topology by calling the wrong API since the mst manager is yet to
be initialized.
[How]
At dm_resume(), once it's detected as mst branc connected, check if
the mst is initialized already. If not, call
dm_helpers_dp_mst_start_top_mgr() instead to initialize mst
V2: Adjust the commit msg a bit
Fixes: bc068194f5 ("drm/amd/display: Don't write DP_MSTM_CTRL after LT")
Cc: Fangzhi Zuo <jerry.zuo@amd.com>
Cc: Mario Limonciello <mario.limonciello@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Tom Chung <chiahsuan.chung@amd.com>
Signed-off-by: Wayne Lin <Wayne.Lin@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
PMFW will manage RAS eeprom data by itself, add new interface to read
eeprom data via PMFW, we can read part of records by setting index.
v2: use IPID parse interface.
pa is not used and set it to a fixed value.
v3: optimize the null pointer check for IPID parse interface.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
So we can call it in other blocks.
v2: add a new IPID parse interface for umc and we can
implement it for each ASIC.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
commit c760bcda83 ("drm/amd: Check whether secure display TA loaded
successfully") attempted to fix extra messages, but failed to port the
cleanup that was in commit 5c6d52ff4b ("drm/amd: Don't try to enable
secure display TA multiple times") to prevent multiple tries.
Add that to the failure handling path even on a quick failure.
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4679
Fixes: c760bcda83 ("drm/amd: Check whether secure display TA loaded successfully")
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
When amdgpu_device_health_check fails, amdgpu_ras_pre_reset
will not be called and therefore amdgpu_ras_post_reset
cannot be called either.
Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
On PF passthrough environment, after hibernate and then resume, coralgemm
will cause gpu page fault.
Mode1 reset happens during hibernate, but partition mode is not restored
on resume, register mmCP_HYP_XCP_CTL and mmCP_PSP_XCP_CTL is not right
after resume. When CP access the MQD BO, wrong stride size is used,
this will cause out of bound access on the MQD BO, resulting page fault.
The fix is to ensure gfx_v9_4_3_switch_compute_partition() is called
when resume from a hibernation.
KFD resume is called separately during a reset recovery or resume from
suspend sequence. Hence it's not required to be called as part of
partition switch.
Signed-off-by: Samuel Zhang <guoqing.zhang@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
If we don't end up initializing the fences, free them when
we free the job. We can't set the hw_fence to NULL after
emitting it because we need it in the cleanup path for the
submit direct case.
v2: take a reference to the fences if we emit them
v3: handle non-job fence in error paths
Fixes: db36632ea5 ("drm/amdgpu: clean up and unify hw fence handling")
Reviewed-by: Jesse Zhang <Jesse.Zhang@amd.com> (v1)
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
During gpu reset, all GPU-related resources are
inaccessible. To avoid affecting ras functionality,
suspend ras module before gpu reset and resume
it after gpu reset is complete.
V2:
Rename functions to avoid misunderstanding.
V3:
Move flush_delayed_work to amdgpu_ras_process_pause,
Move schedule_delayed_work to amdgpu_ras_process_unpause.
V4:
Rename functions.
V5:
Move the function to amdgpu_ras.c.
Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Acked-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
add wrapper functions for pmfw eeprom interface, for these interfaces
to be easily and safely called
Signed-off-by: Gangliang Xie <ganglxie@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
add function to check if pmfw is supported, skip eeprom
check and recover when pmfw eeprom is supported
Signed-off-by: Gangliang Xie <ganglxie@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
To fix regression report on gfx8, which requires the exhaustive search
path for signaled event.
The high CPU usage of KFD interrupt wq issue is gone after HIP/ROCr add
option to reduce HW event interrupts, safe to revert this optimization
patch now.
This reverts commit de844846f7.
Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
If process is killed. the vm entity is stopped, submit pt update job
will trigger the error message "*ERROR* Trying to push to a killed
entity", job will not execute.
Suggested-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Use a schema similar to gpu metrics v1.9 for partition metrics also. It
will have field type encoded followed by the field value(s). The
attribute ids used will be shared with gpu metrics. The structure
definition is only to distinguish between gpu metrics and partition
metrics though both gpu metrics v1.9 and partition metrics v1.1 follow
the same definition.
Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Asad Kamal <asad.kamal@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
BO's reservation object must be locked before using
amdgpu_vm_bo_update_shared otherwise dma_resv_assert_held will
complain in amdgpu_vm_update_shared.
Signed-off-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
../ras/rascore/ras_cper.c: In function ‘cper_generate_fatal_record.isra’:
../ras/rascore/ras_cper.c:75:36: error: ‘%llX’ directive output may be truncated writing between 1 and 14 bytes into a region of size between 0 and 7 [-Werror=format-truncation=]
75 | snprintf(record_id, 9, "%d:%llX", dev_info.socket_id,
| ^~~~
../ras/rascore/ras_cper.c:75:32: note: directive argument in the range [0, 72057594037927935]
75 | snprintf(record_id, 9, "%d:%llX", dev_info.socket_id,
| ^~~~~~~~~
../ras/rascore/ras_cper.c:75:9: note: ‘snprintf’ output between 4 and 27 bytes into a destination of size 9
75 | snprintf(record_id, 9, "%d:%llX", dev_info.socket_id,
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
76 | RAS_LOG_SEQNO_TO_BATCH_IDX(trace->seqno));
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
../ras/rascore/ras_cper.c: In function ‘cper_generate_runtime_record.isra’:
../ras/rascore/ras_cper.c:75:36: error: ‘%llX’ directive output may be truncated writing between 1 and 14 bytes into a region of size between 0 and 7 [-Werror=format-truncation=]
75 | snprintf(record_id, 9, "%d:%llX", dev_info.socket_id,
| ^~~~
../ras/rascore/ras_cper.c:75:32: note: directive argument in the range [0, 72057594037927935]
75 | snprintf(record_id, 9, "%d:%llX", dev_info.socket_id,
| ^~~~~~~~~
../ras/rascore/ras_cper.c:75:9: note: ‘snprintf’ output between 4 and 27 bytes into a destination of size 9
75 | snprintf(record_id, 9, "%d:%llX", dev_info.socket_id,
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
76 | RAS_LOG_SEQNO_TO_BATCH_IDX(trace->seqno));
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
cc1: all warnings being treated as errors
Signed-off-by: Xiang Liu <xiang.liu@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
This version brings along following update:
- HDCP2 FW locality check refactors
- Fix black screen issue with HDMI output
- Increase IB mem size
- Revert max buffered cursor size to 64
- Extend inbox0 lock to run Replay / PSR
- Refactor VActive implementation
- Add Pstate viewport reduction
- Persist stream refcount through restore
Acked-by: Wayne Lin <Wayne.Lin@amd.com>
Signed-off-by: Taimur Hassan <Syed.Hassan@amd.com>
Signed-off-by: Ray Wu <ray.wu@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
[Why & How]
This fixes the black screen issue on certain APUs with HDMI,
accompanied by the following messages:
amdgpu 0000:c4:00.0: amdgpu: [drm] Failed to setup vendor info
frame on connector DP-1: -22
amdgpu 0000:c4:00.0: [drm] Cannot find any crtc or sizes [drm]
Cannot find any crtc or sizes
Fixes: 489f0f600c ("drm/amd/display: Fix DVI-D/HDMI adapters")
Suggested-by: Timur Kristóf <timur.kristof@gmail.com>
Reviewed-by: Harry Wentland <harry.wentland@amd.com>
Signed-off-by: Alex Hung <alex.hung@amd.com>
Signed-off-by: Ray Wu <ray.wu@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
[Why & How]
The buffered cursor cap is expressed assuming a square cursor, and usage
of the cursor buffer is limited by the request size. For greater than 32
pixels, the request size is fixed at 256 bytes, so the maximum width
must be floored to the nearest 256th byte. At 4bpp this means even with
24kB DCN4 can only hold a 64x64 cursor in the buffer as even 65 pixels
would require 512 bytes per line instead of 256.
Reviewed-by: Alvin Lee <alvin.lee2@amd.com>
Signed-off-by: Dillon Varone <Dillon.Varone@amd.com>
Signed-off-by: Ray Wu <ray.wu@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
[Why & How]
Overwriting the refcount on stream restore can lead to double-free errors
or memory leaks if an unbalanced number of retains and releases occurs
between a backup and restore.
Reviewed-by: Dillon Varone <dillon.varone@amd.com>
Signed-off-by: Joshua Aberback <joshua.aberback@amd.com>
Signed-off-by: Ray Wu <ray.wu@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
[Why & How]
Refactors VActive accounting in PMO, and breaks down fill time
requirement by P-State type as it can result in drasitcally different
bandwidth requirements depending on the blackout length.
Reviewed-by: Dillon Varone <dillon.varone@amd.com>
Signed-off-by: Austin Zheng <Austin.Zheng@amd.com>
Signed-off-by: Ray Wu <ray.wu@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>