Commit Graph

1339135 Commits

Author SHA1 Message Date
Kenneth Feng
c770ef1967 drm/amd/amdgpu: disable ASPM in some situations
disable ASPM with some ASICs on some specific platforms.
required from PCIe controller owner.

Signed-off-by: Kenneth Feng <kenneth.feng@amd.com>
Reviewed-by: Yang Wang <kevinyang.wang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-04-11 16:54:48 -04:00
Prike Liang
0ec7535f5b drm/amdgpu: remove the duplicated mes queue active state setting
The MES queue deactivation and active status are already set in
mes_userq_unmap|map(), so the caller needn't set the queue_active
bit again.

Signed-off-by: Prike Liang <Prike.Liang@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-04-11 16:54:39 -04:00
Ruili Ji
f0ec5926da amd/amdgpu: Implement VCN queue reset for vcn 4.0.3
Add function for vcn queue reset to make driver to
do fine-grained reset instead of the whole gpu reset.

Reviewed-by: Sonny Jiang <sonny.jiang@amd.com>
Reviewed-by: Leo Liu <leo.liu@amd.com>
Signed-off-by: Ruili Ji <ruiliji2@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-04-11 16:54:27 -04:00
Masha Grinman
8ef4e99674 drm/amdgpu: Move read of snoop register from guest to host
Guest is reading/writing to snoop register which is a security violation
We moved the code to the host driver
And also added a validation on the guest side to check if it's guest

Signed-off-by: Masha Grinman <Masha.Grinman@amd.com>
Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-04-11 16:54:02 -04:00
Mario Limonciello
2aabd44aa8 drm/amd: Forbid suspending into non-default suspend states
On systems that default to 'deep' some userspace software likes
to try to suspend in 'deep' first.  If there is a failure for any
reason (such as -ENOMEM) the failure is ignored and then it will
try to use 's2idle' as a fallback. This fails, but more importantly
it leads to graphical problems.

Forbid this behavior and only allow suspending in the last state
supported by the system.

Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4093
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Link: https://lore.kernel.org/r/20250408180957.4027643-1-superm1@kernel.org
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-04-11 16:53:08 -04:00
Christian König
8b2ae7d492 drm/amdgpu: use a dummy owner for sysfs triggered cleaner shaders v4
Otherwise triggering sysfs multiple times without other submissions in
between only runs the shader once.

v2: add some comment
v3: re-add missing cast
v4: squash in semicolon fix

Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-04-11 16:52:33 -04:00
Alex Deucher
72801504fd drm/amdgpu/sdma7: add support for disable_kq
When the parameter is set, disable user submissions
to kernel queues.

Reviewed-by: Sunil Khatri <sunil.khatri@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-04-08 16:48:23 -04:00
Alex Deucher
fcf5eb979a drm/amdgpu/sdma6: add support for disable_kq
When the parameter is set, disable user submissions
to kernel queues.

Reviewed-by: Sunil Khatri <sunil.khatri@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-04-08 16:48:23 -04:00
Alex Deucher
1d65006fc1 drm/amdgpu/sdma: add flag for tracking disable_kq
For SDMA, we still need kernel queues for paging so
they need to be initialized, but we no not want to
accept submissions from userspace when disable_kq
is set.

Reviewed-by: Sunil Khatri <sunil.khatri@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-04-08 16:48:23 -04:00
Alex Deucher
0981e0ef18 drm/amdgpu/gfx12: add support for disable_kq
Plumb in support for disabling kernel queues.

v2: use ring counts per Felix' suggestion
v3: fix stream fault handler, enable EOP interrupts
v4: fix MEC interrupt offset (Sunil)
v5: clean up after removing extra sched.ready settings

Reviewed-by: Sunil Khatri <sunil.khatri@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-04-08 16:48:23 -04:00
Alex Deucher
1e63ebc0d4 drm/amdgpu/gfx11: add support for disable_kq
Plumb in support for disabling kernel queues in
GFX11.  We have to bring up a GFX queue briefly in
order to initialize the clear state.  After that
we can disable it.

v2: use ring counts per Felix' suggestion
v3: fix stream fault handler, enable EOP interrupts
v4: fix MEC interrupt offset (Sunil)

Reviewed-by: Sunil Khatri <sunil.khatri@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-04-08 16:48:22 -04:00
Alex Deucher
1f61fc28b9 drm/amdgpu/mes: make more vmids available when disable_kq=1
If we don't have kernel queues, the vmids can be used by
the MES for user queues.

Acked-by: Prike Liang <Prike.Liang@amd.com>
Reviewed-by: Sunil Khatri <sunil.khatri@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-04-08 16:48:22 -04:00
Alex Deucher
acdc43f270 drm/amdgpu/mes: update hqd masks when disable_kq is set
Make all resources available to user queues.

Reviewed-by: Prike Liang <Prike.Liang@amd.com>
Suggested-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Sunil Khatri <sunil.khatri@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-04-08 16:48:22 -04:00
Alex Deucher
f091fa777b drm/amdgpu/gfx: add generic handling for disable_kq
Add proper checks for disable_kq functionality in
gfx helper functions.  Add special logic for families
that require the clear state setup.

v2: use ring count as per Felix suggestion
v3: fix num_gfx_rings handling in amdgpu_gfx_graphics_queue_acquire()
v4: fix error code (Alex)

Reviewed-by: Prike Liang <Prike.Liang@amd.com>
Reviewed-by: Sunil Khatri <sunil.khatri@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-04-08 16:48:22 -04:00
Alex Deucher
4310acd446 drm/amdgpu: add ring flag for no user submissions
This would be set by IPs which only accept submissions
from the kernel, not userspace, such as when kernel
queues are disabled. Don't expose the rings to userspace
and reject any submissions in the CS IOCTL.

v2: fix error code (Alex)

Reviewed-by: Sunil Khatri<sunil.khatri@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-04-08 16:48:22 -04:00
Alex Deucher
a96a787d6d drm/amdgpu: add parameter to disable kernel queues
On chips that support user queues, setting this option
will disable kernel queues to be used to validate
user queues without kernel queues.

Reviewed-by: Prike Liang <Prike.Liang@amd.com>
Reviewed-by: Sunil Khatri <sunil.khatri@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-04-08 16:48:22 -04:00
Alex Deucher
cf97de5b54 drm/amdgpu/userq: prevent runtime pm when userqs are active
Similar to KFD, prevent runtime pm while user queues are active.

Reviewed-by: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-04-08 16:48:22 -04:00
Alex Deucher
4ce60dbada drm/amdgpu: store userq_managers in a list in adev
So we can iterate across them when we need to manage
all user queues.

v2: add uq_mgr to adev list in amdgpu_userq_mgr_init

Reviewed-by: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-04-08 16:48:22 -04:00
Alex Deucher
100b6010d7 drm/amdgpu: bump version for user queue IP support query
Add the user queue IP support query to the drm_amdgpu_info_device
query.

Cc: marek.olsak@amd.com
Cc: prike.liang@amd.com
Cc: sunil.khatri@amd.com
Cc: yogesh.mohanmarimuthu@amd.com
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Reviewed-by: Sunil Khatri <sunil.khatri@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-04-08 16:48:22 -04:00
Alex Deucher
1af6881263 drm/amdgpu: add UAPI to query if user queues are supported
Add an INFO query to check if user queues are supported.

v2: switch to a mask of IPs (Marek)
v3: move to drm_amdgpu_info_device (Marek)

Cc: marek.olsak@amd.com
Cc: prike.liang@amd.com
Cc: sunil.khatri@amd.com
Cc: yogesh.mohanmarimuthu@amd.com
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Reviewed-by: Sunil Khatri <sunil.khatri@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-04-08 16:48:22 -04:00
Alex Deucher
a4a3373da2 drm/amdgpu/gfx12: split userq setup to a separate switch
Add a separate switch statement for the userq callback
assignment so that we can assign the callbacks for each
asic as the firmware becomes available.

Reviewed-by: Sunil Khatri <sunil.khatri@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-04-08 16:48:22 -04:00
Alex Deucher
9983ed9693 drm/amdgpu/gfx11: clean up and consolidate sw_init
With the ME details fixed, we can now consolidate
this state.  Also split out the userq setup into a separate
switch statement so that we can set them per IP version
when the firmwares are ready.

Reviewed-by: Sunil Khatri <sunil.khatri@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-04-08 16:48:22 -04:00
Arvind Yadav
32bd8b3ea7 drm/amdgpu: Fix display freezing issue when resizing apps
The display is freezing because the amdgpu_userq_wait_ioctl()
is waiting for a non-user queue fence(specifically, the PT update fence).

RootCause:
The resume_work is initiated by both amdgpu_userq_suspend and
amdgpu_userqueue_ensure_ev_fence at same time. The amdgpu_userq_suspend
signals a dma-fence and subsequently triggers the resume_work, which is
intended to replace the existing fence by creating new dma-fence. However,
following this, the amdgpu_userqueue_ensure_ev_fence schedules another
resume_work that generates a new dma-fence, thereby replacing the one
created by amdgpu_userq_suspend. Consequently, the original fence will
never be signaled.

Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Shashank Sharma <shashank.sharma@amd.com>
Cc: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Shashank Sharma <shashank.sharma@amd.com>
Signed-off-by: Arvind Yadav <arvind.yadav@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-04-08 16:48:21 -04:00
Alex Deucher
b6f190e623 drm/amdgpu/mes: warn on unexpected pipe numbers
Warn if the number of pipes exceeds what the MES supports.

Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-04-08 16:48:21 -04:00
Alex Deucher
9e2bbba1d5 drm/amdgpu/mes: centralize gfx_hqd mask management
Move it to amdgpu_mes to align with the compute and
sdma hqd masks. No functional change.

v2: rebase on new changes
v3: misc optimizations

Reviewed-by: Prike Liang <Prike.Liang@amd.com>
Reviewed-by: Sunil Khatri<sunil.khatri@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-04-08 16:48:21 -04:00
Alex Deucher
4220d2c7c4 drm/amdgpu: remove is_mes_queue flag
This was leftover from MES bring up when we had MES
user queues in the kernel.  It's no longer used so
remove it.

Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-04-08 16:48:21 -04:00
Alex Deucher
cb17fff3a2 drm/amdgpu/mes: remove unused functions
Leftover from the MES self tests that were removed previously.

Reviewed-by: Mukul Joshi <mukul.joshi@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-04-08 16:48:21 -04:00
Alex Deucher
158bfbc72c drm/amdgpu: validate user queue parameters
Make sure these are set properly to ensure compatibility if
we ever update the IOCTL interface.

Reviewed-by: Prike Liang <Prike.Liang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-04-08 16:48:21 -04:00
Arvind Yadav
ad6c120f68 drm/amdgpu: fix the memleak caused by fence not released
Encountering a taint issue during the unloading of gpu_sched
due to the fence not being released/put. In this context,
amdgpu_vm_clear_freed is responsible for creating a job to
update the page table (PT). It allocates kmem_cache for
drm_sched_fence and returns the finished fence associated
with job->base.s_fence. In case of Usermode queue this finished
fence is added to the timeline sync object through
amdgpu_gem_update_bo_mapping, which is utilized by user
space to ensure the completion of the PT update.

[  508.900587] =============================================================================
[  508.900605] BUG drm_sched_fence (Tainted: G                 N): Objects remaining in drm_sched_fence on __kmem_cache_shutdown()
[  508.900617] -----------------------------------------------------------------------------

[  508.900627] Slab 0xffffe0cc04548780 objects=32 used=2 fp=0xffff8ea81521f000 flags=0x17ffffc0000240(workingset|head|node=0|zone=2|lastcpupid=0x1fffff)
[  508.900645] CPU: 3 UID: 0 PID: 2337 Comm: rmmod Tainted: G                 N 6.12.0+ #1
[  508.900651] Tainted: [N]=TEST
[  508.900653] Hardware name: Gigabyte Technology Co., Ltd. X570 AORUS ELITE/X570 AORUS ELITE, BIOS F34 06/10/2021
[  508.900656] Call Trace:
[  508.900659]  <TASK>
[  508.900665]  dump_stack_lvl+0x70/0x90
[  508.900674]  dump_stack+0x14/0x20
[  508.900678]  slab_err+0xcb/0x110
[  508.900687]  ? srso_return_thunk+0x5/0x5f
[  508.900692]  ? try_to_grab_pending+0xd3/0x1d0
[  508.900697]  ? srso_return_thunk+0x5/0x5f
[  508.900701]  ? mutex_lock+0x17/0x50
[  508.900708]  __kmem_cache_shutdown+0x144/0x2d0
[  508.900713]  ? flush_rcu_work+0x50/0x60
[  508.900719]  kmem_cache_destroy+0x46/0x1f0
[  508.900728]  drm_sched_fence_slab_fini+0x19/0x970 [gpu_sched]
[  508.900736]  __do_sys_delete_module.constprop.0+0x184/0x320
[  508.900744]  ? srso_return_thunk+0x5/0x5f
[  508.900747]  ? debug_smp_processor_id+0x1b/0x30
[  508.900754]  __x64_sys_delete_module+0x16/0x20
[  508.900758]  x64_sys_call+0xdf/0x20d0
[  508.900763]  do_syscall_64+0x51/0x120
[  508.900769]  entry_SYSCALL_64_after_hwframe+0x76/0x7e

v2: call dma_fence_put in amdgpu_gem_va_update_vm
v3: Addressed review comments from Christian.
    - calling amdgpu_gem_update_timeline_node before switch.
    - puting a dma_fence in case of error or !timeline_syncobj.
v4: Addressed review comments from Christian.

Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Shashank Sharma <shashank.sharma@amd.com>
Cc: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Le Ma <le.ma@amd.com>
Signed-off-by: Arvind Yadav <arvind.yadav@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-04-08 16:48:21 -04:00
Alex Deucher
ecdb0b32e5 drm/amdgpu/userq: move the header to amdgpu directory
To align with other headers.

Reviewed-by: Prike Liang <Prike.Liang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-04-08 16:48:21 -04:00
Alex Deucher
665de8c947 drm/amdgpu/userq: remove BROKEN from config
This can be enabled now.  We have the firmware checks
in place.

Reviewed-by: Prike Liang <Prike.Liang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-04-08 16:48:21 -04:00
Alex Deucher
5ca4095960 drm/amdgpu: add userq firmware version checks
Currently disabled until the firmwares are officially
released.

Reviewed-by: Prike Liang <Prike.Liang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-04-08 16:48:21 -04:00
Alex Deucher
f36e4876c8 drm/amdgpu/gfx11: fix config guard
s/CONFIG_DRM_AMD_USERQ_GFX/CONFIG_DRM_AMDGPU_NAVI3X_USERQ/

Reviewed-by: Prike Liang <Prike.Liang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-04-08 16:48:21 -04:00
Alex Deucher
c4f42c8d0b drm/amdgpu/Kconfig: fix wording of DRM_AMDGPU_NAVI3X_USERQ
The feature is not navi3x specific at this point.

Reviewed-by: Prike Liang <Prike.Liang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-04-08 16:48:20 -04:00
Alex Deucher
df85baa767 drm/amdgpu: return an error in the userq IOCTL when DRM_AMDGPU_NAVI3X_USERQ=n
I'd swear this was already fixed, but I guess the patch never
landed.  Add it now.

Reviewed-by: Prike Liang <Prike.Liang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-04-08 16:48:20 -04:00
Alex Deucher
2a060b3ae9 drm/amdgpu/userq: handle runtime pm
Take a reference when we create a queue and drop it
when we destroy the queue.  We need to keep the device
active while user queues are active.

v2: squash in fix from Sunil
v3: squash in fix from Prike

Reviewed-by: Prike Liang <Prike.Liang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-04-08 16:48:20 -04:00
Alex Deucher
29adc5c2dd drm/amdgpu/userq: fix hardcoded uq functions
Use the IP type to look up the userq functions rather
than hardcoding it.

Reviewed-by: Saleemkhan Jamadar <saleemkhan.jamadar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-04-08 16:48:20 -04:00
Arvind Yadav
f15d4e92f7 drm/amdgpu: Fix display freeze lockup error
A deadlock situation has arised between the userq
signal ioctl and the eviction fence. In this scenario,
the function amdgpu_userq_signal_ioctl() has acquired a reservation
lock on the read/write buffer object (BO) through drm_exec.
Subsequently, it calls amdgpu_userqueue_ensure_ev_fence(),
which is in a waiting for the userq resume work.
Meanwhile, the userq suspend worker has initiated the userq resume
work(amdgpu_userqueue_resume_worker). This userq resume work attempts
to validate the vm->done BO, leading to amdgpu_userqueue_validate_bos
also attempting to reservation lock the same write BO that is already
locked by amdgpu_userq_signal_ioctl.
As a result, the resume work becomes stalled, causing
amdgpu_userqueue_ensure_ev_fence to remain in a waiting state.

Call Trace:
[  242.836469] INFO: task gnome-shel:cs0:1288 blocked for more than 120 seconds.
[  242.836486]       Tainted: G           OE      6.12.0-rc2rebased-oct-24+ #4
[  242.836491] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  242.836494] task:gnome-shel:cs0  state:D stack:0     pid:1288  tgid:1282  ppid:1180   flags:0x00000002
[  242.836503] Call Trace:
[  242.836508]  <TASK>
[  242.836517]  __schedule+0x3e0/0xb10
[  242.836530]  ? srso_return_thunk+0x5/0x5f
[  242.836541]  schedule+0x31/0x120
[  242.836546]  schedule_timeout+0x150/0x160
[  242.836551]  ? srso_return_thunk+0x5/0x5f
[  242.836555]  ? sysvec_call_function+0x69/0xd0
[  242.836562]  ? srso_return_thunk+0x5/0x5f
[  242.836567]  ? preempt_count_add+0x7f/0xd0
[  242.836577]  __wait_for_common+0x91/0x180
[  242.836582]  ? __pfx_schedule_timeout+0x10/0x10
[  242.836590]  wait_for_completion+0x28/0x30
[  242.836595]  __flush_work+0x16c/0x290
[  242.836602]  ? __pfx_wq_barrier_func+0x10/0x10
[  242.836611]  flush_delayed_work+0x3a/0x60
[  242.836621]  amdgpu_userqueue_ensure_ev_fence+0x2d/0xb0 [amdgpu]
[  242.836966]  amdgpu_userq_signal_ioctl+0x959/0xec0 [amdgpu]
[  242.837171]  ? __pfx_amdgpu_userq_signal_ioctl+0x10/0x10 [amdgpu]
[  242.837365]  drm_ioctl_kernel+0xae/0x100 [drm]
[  242.837398]  drm_ioctl+0x2a1/0x500 [drm]
[  242.837420]  ? __pfx_amdgpu_userq_signal_ioctl+0x10/0x10 [amdgpu]
[  242.837622]  ? srso_return_thunk+0x5/0x5f
[  242.837627]  ? srso_return_thunk+0x5/0x5f
[  242.837630]  ? _raw_spin_unlock_irqrestore+0x2b/0x50
[  242.837635]  amdgpu_drm_ioctl+0x4f/0x90 [amdgpu]
[  242.837811]  __x64_sys_ioctl+0x99/0xd0
[  242.837820]  x64_sys_call+0x1209/0x20d0
[  242.837825]  do_syscall_64+0x51/0x120
[  242.837830]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  242.837835] RIP: 0033:0x7f2f33f1a94f
[  242.837838] RSP: 002b:00007f2f24ffea30 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  242.837842] RAX: ffffffffffffffda RBX: 00007f2f24ffebd0 RCX: 00007f2f33f1a94f
[  242.837845] RDX: 00007f2f24ffebd0 RSI: 00000000c0306457 RDI: 000000000000000d
[  242.837847] RBP: 00007f2f24ffeab0 R08: 0000000000000000 R09: 0000000000000000
[  242.837849] R10: 00007f2f24ffecd0 R11: 0000000000000246 R12: 00007f2f25000640
[  242.837851] R13: 00000000c0306457 R14: 000000000000000d R15: 00007fff3b39c1e0
[  242.837858]  </TASK>
[  242.837865] INFO: task Xwayland:cs0:1517 blocked for more than 120 seconds.
[  242.837869]       Tainted: G           OE      6.12.0-rc2rebased-oct-24+ #4
[  242.837872] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  242.837874] task:Xwayland:cs0    state:D stack:0     pid:1517  tgid:1338  ppid:1282   flags:0x00004002
[  242.837878] Call Trace:
[  242.837880]  <TASK>
[  242.837883]  __schedule+0x3e0/0xb10
[  242.837890]  schedule+0x31/0x120
[  242.837894]  schedule_preempt_disabled+0x1c/0x30
[  242.837897]  __mutex_lock.constprop.0+0x386/0x6e0
[  242.837902]  ? srso_return_thunk+0x5/0x5f
[  242.837905]  ? __timer_delete_sync+0x81/0xe0
[  242.837911]  __mutex_lock_slowpath+0x13/0x20
[  242.837915]  mutex_lock+0x3b/0x50
[  242.837919]  amdgpu_userqueue_ensure_ev_fence+0x35/0xb0 [amdgpu]
[  242.838138]  amdgpu_userq_signal_ioctl+0x959/0xec0 [amdgpu]
[  242.838340]  ? __pfx_amdgpu_userq_signal_ioctl+0x10/0x10 [amdgpu]
[  242.838531]  drm_ioctl_kernel+0xae/0x100 [drm]
[  242.838559]  drm_ioctl+0x2a1/0x500 [drm]
[  242.838580]  ? __pfx_amdgpu_userq_signal_ioctl+0x10/0x10 [amdgpu]
[  242.838778]  ? srso_return_thunk+0x5/0x5f
[  242.838783]  ? srso_return_thunk+0x5/0x5f
[  242.838786]  ? _raw_spin_unlock_irqrestore+0x2b/0x50
[  242.838791]  amdgpu_drm_ioctl+0x4f/0x90 [amdgpu]
[  242.838967]  __x64_sys_ioctl+0x99/0xd0
[  242.838972]  x64_sys_call+0x1209/0x20d0
[  242.838975]  do_syscall_64+0x51/0x120
[  242.838979]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  242.838982] RIP: 0033:0x7f9118b1a94f
[  242.838985] RSP: 002b:00007f910cdff760 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  242.838989] RAX: ffffffffffffffda RBX: 00007f910cdff910 RCX: 00007f9118b1a94f
[  242.838991] RDX: 00007f910cdff910 RSI: 00000000c0306457 RDI: 000000000000000c
[  242.838993] RBP: 00007f910cdff7e0 R08: 0000000000000000 R09: 0000000000000001
[  242.838995] R10: 00007f910cdff9d4 R11: 0000000000000246 R12: 00007f910ce00640
[  242.838997] R13: 00000000c0306457 R14: 000000000000000c R15: 00007fff9dd11d10
[  242.839004]  </TASK>

v2: Addressed review comemnts from Christian.
v3/v4: Addressed review comemnts from Christian.
   - Move drm_exec drm_exec loop after userq fence create.
   - cleanup the newly created userq fence in case of error.
v5 - Addressed review comemnts from Christian.
   - Create a new amdgpu_userq_fence_alloc() function for allocation.
   - Calling dma_fence_put for cleanup procedure.
   - make amdgpu_userq_fence_create() function static.
   - drm_exec_init is called after mutex_unlock.

Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Shashank Sharma <shashank.sharma@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Arvind Yadav <arvind.yadav@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-04-08 16:48:20 -04:00
Arunpravin Paneer Selvam
fc4a85c6b2 drm/amdgpu: Modify the seq64 VM cache policy
The seq64 VM cache policy should be set to UC (Uncached) to
match with userqueue fence address kernel mapped memory's
cache settings.

Signed-off-by: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-04-08 16:48:20 -04:00
Arunpravin Paneer Selvam
239a310b49 drm/amdgpu: Fix out-of-bounds issue in user fence
Fix out-of-bounds issue in userq fence create when
accessing the userq xa structure. Added a lock to
protect the race condition.

v2:(Christian)
  - Allocate memory with GFP_ATOMIC.

v3:
  - Moved to 2 xa approach.

v4:(Christian)
  - Lock the xa_for_each blocks and memory allocation part
    as well to make sure that xa is not modified in between
    the 2 xa_for_each blocks.

BUG: KASAN: slab-out-of-bounds in amdgpu_userq_fence_create+0x726/0x880 [amdgpu]
[  +0.000006] Call Trace:
[  +0.000005]  <TASK>
[  +0.000005]  dump_stack_lvl+0x6c/0x90
[  +0.000011]  print_report+0xc4/0x5e0
[  +0.000009]  ? srso_return_thunk+0x5/0x5f
[  +0.000008]  ? kasan_complete_mode_report_info+0x26/0x1d0
[  +0.000007]  ? amdgpu_userq_fence_create+0x726/0x880 [amdgpu]
[  +0.000405]  kasan_report+0xdf/0x120
[  +0.000009]  ? amdgpu_userq_fence_create+0x726/0x880 [amdgpu]
[  +0.000405]  __asan_report_store8_noabort+0x17/0x20
[  +0.000007]  amdgpu_userq_fence_create+0x726/0x880 [amdgpu]
[  +0.000406]  ? __pfx_amdgpu_userq_fence_create+0x10/0x10 [amdgpu]
[  +0.000408]  ? srso_return_thunk+0x5/0x5f
[  +0.000008]  ? ttm_resource_move_to_lru_tail+0x235/0x4f0 [ttm]
[  +0.000013]  ? srso_return_thunk+0x5/0x5f
[  +0.000008]  amdgpu_userq_signal_ioctl+0xd29/0x1c70 [amdgpu]
[  +0.000412]  ? __pfx_amdgpu_userq_signal_ioctl+0x10/0x10 [amdgpu]
[  +0.000404]  ? try_to_wake_up+0x165/0x1840
[  +0.000010]  ? __pfx_futex_wake_mark+0x10/0x10
[  +0.000011]  drm_ioctl_kernel+0x178/0x2f0 [drm]
[  +0.000050]  ? __pfx_amdgpu_userq_signal_ioctl+0x10/0x10 [amdgpu]
[  +0.000404]  ? __pfx_drm_ioctl_kernel+0x10/0x10 [drm]
[  +0.000043]  ? __kasan_check_read+0x11/0x20
[  +0.000007]  ? srso_return_thunk+0x5/0x5f
[  +0.000007]  ? __kasan_check_write+0x14/0x20
[  +0.000008]  drm_ioctl+0x513/0xd20 [drm]
[  +0.000040]  ? __pfx_amdgpu_userq_signal_ioctl+0x10/0x10 [amdgpu]
[  +0.000407]  ? __pfx_drm_ioctl+0x10/0x10 [drm]
[  +0.000044]  ? srso_return_thunk+0x5/0x5f
[  +0.000007]  ? _raw_spin_lock_irqsave+0x99/0x100
[  +0.000007]  ? __pfx__raw_spin_lock_irqsave+0x10/0x10
[  +0.000006]  ? __rseq_handle_notify_resume+0x188/0xc30
[  +0.000008]  ? srso_return_thunk+0x5/0x5f
[  +0.000008]  ? srso_return_thunk+0x5/0x5f
[  +0.000006]  ? _raw_spin_unlock_irqrestore+0x27/0x50
[  +0.000010]  amdgpu_drm_ioctl+0xcd/0x1d0 [amdgpu]
[  +0.000388]  __x64_sys_ioctl+0x135/0x1b0
[  +0.000009]  x64_sys_call+0x1205/0x20d0
[  +0.000007]  do_syscall_64+0x4d/0x120
[  +0.000008]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  +0.000007] RIP: 0033:0x7f7c3d31a94f

Signed-off-by: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-04-08 16:48:20 -04:00
Saleemkhan Jamadar
49cd3353db drm/amdgpu: add db size and offset range for VCN and VPE
VCN and VPE have different offset range, update the doorbell
offset range repsectively.
Doorbell size for VCN and VPE is 32bit.

v1 : add gfx switch case and fix checkpatch warnings (Shashank)

Signed-off-by: Saleemkhan Jamadar <saleemkhan.jamadar@amd.com>
Reviewed-by: Shashank Sharma <shashank.sharma@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-04-08 16:48:20 -04:00
Saleemkhan Jamadar
3e37fcb57b drm/amdgpu: map doorbell for the requested userq
Introduce db_info structure to the populate the doorbell
information that is required to be mapped.

Made changes to the doorbell mapping func more generic,
by taking parameters that vary based on IPs and/or usecase
into db_info structure.

v2 - Fix space alignment and checkpatch warnings(Shashank)

Signed-off-by: Saleemkhan Jamadar <saleemkhan.jamadar@amd.com>
Reviewed-by: Shashank Sharma <shashank.sharma@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-04-08 16:48:20 -04:00
Christian König
8639d2f5ca drm/amdgpu: fix call to amdgpu_eviction_fence_detach
That needs to be done after grabbing the lock, not before.

Signed-off-by: Christian König <christian.koenig@amd.com>
Acked-by: Arvind Yadav <arvind.yadav@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-04-08 16:48:20 -04:00
Arvind Yadav
adba092973 drm/amdgpu: Fix Illegal opcode in command stream Error
When applications closes, it triggers the drm_file_free
function which subsequently releases all allocated buffer
objects. Concurrently, the resume_worker thread will attempt
to map the usermode queue. However, since the wptr buffer
object has already been deallocated, this will result in
an Illegal opcode error being raised in the command stream.

Now replacing drm_release() with a new function
amdgpu_drm_release(). This function will set the flag to
prevent the scheduling of any new queue resume/map, stop
all queues and then call drm_release().

V2:
  - Replace drm_release with amdgpu_drm_release(Christian).

Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Christian Koenig <christian.koenig@amd.com>
Reviewed-by: Shashank Sharma <shashank.sharma@amd.com>
Signed-off-by: Arvind Yadav <arvind.yadav@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-04-08 16:48:20 -04:00
Arunpravin Paneer Selvam
02521454f0 drm/amdgpu: Apply sign extension to seq64
Apply sign extension to seq64 va address.

Signed-off-by: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-04-08 16:48:20 -04:00
Christian König
91acb5d47b drm/amdgpu: Modify the MES process va end limit
Modify the MES process va end limit to max pfn.

Signed-off-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-04-08 16:48:19 -04:00
Arunpravin Paneer Selvam
ed5fdc1fc2 drm/amdgpu: Fix the use-after-free issue in wait IOCTL
The xarray pointer which has the userqueue xarray structure
reference should be cleared when the userqueue gets
destroyed. Otherwise, we may access the freed xa memory and
see the below warnings.

warning 1:
BUG: KASAN: slab-use-after-free in _raw_spin_lock+0x7a/0xe0
[  +0.000044] Call Trace:
[  +0.000017]  <TASK>
[  +0.000016]  dump_stack_lvl+0x6c/0x90
[  +0.000025]  print_report+0xc4/0x5e0
[  +0.000025]  ? srso_return_thunk+0x5/0x5f
[  +0.000024]  ? kasan_complete_mode_report_info+0x60/0x1d0
[  +0.000030]  ? _raw_spin_lock+0x7a/0xe0
[  +0.000023]  kasan_report+0xdf/0x120
[  +0.000023]  ? _raw_spin_lock+0x7a/0xe0
[  +0.000025]  kasan_check_range+0xf7/0x1b0
[  +0.000025]  __kasan_check_write+0x14/0x20
[  +0.000024]  _raw_spin_lock+0x7a/0xe0
[  +0.000023]  ? __pfx__raw_spin_lock+0x10/0x10
[  +0.000024]  ? amdgpu_userq_wait_ioctl+0xac0/0x1f30 [amdgpu]
[  +0.000442]  amdgpu_userq_wait_ioctl+0x18fc/0x1f30 [amdgpu]
[  +0.000428]  ? __pfx_amdgpu_userq_wait_ioctl+0x10/0x10 [amdgpu]
[  +0.000424]  ? __pfx_idr_alloc_u32+0x10/0x10
[  +0.000027]  ? srso_return_thunk+0x5/0x5f
[  +0.000024]  ? __kasan_check_write+0x14/0x20
[  +0.000025]  ? srso_return_thunk+0x5/0x5f
[  +0.000024]  ? idr_alloc+0x72/0xc0
[  +0.000023]  ? srso_return_thunk+0x5/0x5f
[  +0.000023]  ? fput+0x1c/0x2f0
[  +0.000025]  drm_ioctl_kernel+0x178/0x2f0 [drm]
[  +0.000065]  ? __pfx_amdgpu_userq_wait_ioctl+0x10/0x10 [amdgpu]
[  +0.000425]  ? __pfx_drm_ioctl_kernel+0x10/0x10 [drm]
[  +0.000064]  ? srso_return_thunk+0x5/0x5f
[  +0.000023]  ? __kasan_check_write+0x14/0x20
[  +0.000025]  drm_ioctl+0x513/0xd20 [drm]
[  +0.000058]  ? __pfx_amdgpu_userq_wait_ioctl+0x10/0x10 [amdgpu]
[  +0.000428]  ? __pfx_drm_ioctl+0x10/0x10 [drm]
[  +0.000061]  ? __pfx__raw_spin_lock_irqsave+0x10/0x10
[  +0.000027]  ? __count_memcg_events+0x11f/0x3a0
[  +0.000027]  ? srso_return_thunk+0x5/0x5f
[  +0.001040]  ? srso_return_thunk+0x5/0x5f
[  +0.000969]  ? _raw_spin_unlock_irqrestore+0x27/0x50
[  +0.000966]  amdgpu_drm_ioctl+0xcd/0x1d0 [amdgpu]
[  +0.001352]  __x64_sys_ioctl+0x135/0x1b0
[  +0.000966]  x64_sys_call+0x1205/0x20d0
[  +0.000968]  do_syscall_64+0x4d/0x120
[  +0.000960]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  +0.000962] RIP: 0033:0x7f42af11a94f

warning 2:
WARNING: at lib/xarray.c:1849 __xa_alloc+0x13a/0x150
[  366.491409] RIP: 0010:__xa_alloc+0x13a/0x150
[  366.491434] Call Trace:
[  366.491437]  <TASK>
[  366.491440]  ? show_regs+0x6d/0x80
[  366.491445]  ? __warn+0x91/0x140
[  366.491450]  ? __xa_alloc+0x13a/0x150
[  366.491453]  ? report_bug+0x1c9/0x1e0
[  366.491459]  ? handle_bug+0x63/0xa0
[  366.491463]  ? exc_invalid_op+0x1d/0x80
[  366.491467]  ? asm_exc_invalid_op+0x1f/0x30
[  366.491476]  ? __xa_alloc+0x13a/0x150
[  366.491484]  amdgpu_userq_wait_ioctl+0xe0e/0xfe0 [amdgpu]
[  366.491743]  ? idr_alloc_u32+0x97/0xd0
[  366.491749]  ? __pfx_amdgpu_userq_wait_ioctl+0x10/0x10 [amdgpu]
[  366.491912]  drm_ioctl_kernel+0xae/0x100 [drm]
[  366.491942]  drm_ioctl+0x2a1/0x500 [drm]
[  366.491961]  ? __pfx_amdgpu_userq_wait_ioctl+0x10/0x10 [amdgpu]
[  366.492127]  ? srso_return_thunk+0x5/0x5f
[  366.492132]  ? srso_return_thunk+0x5/0x5f
[  366.492135]  ? _raw_spin_unlock_irqrestore+0x2b/0x50
[  366.492139]  amdgpu_drm_ioctl+0x4f/0x90 [amdgpu]
[  366.492288]  __x64_sys_ioctl+0x99/0xd0
[  366.492295]  x64_sys_call+0x1209/0x20d0
[  366.492299]  do_syscall_64+0x51/0x120
[  366.492303]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  366.492418] RIP: 0033:0x7f86f3b1a94f

Signed-off-by: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-04-08 16:48:19 -04:00
Arunpravin Paneer Selvam
c9e20cb005 drm/amdgpu: Fix NULL ptr dereference issue for non userq fences
Add the correct fences count variable [num_fences] in the fences
array iteration to handle the userq / non-userq fences.

v2:(Christian)
  - All fences in the array either come from some reservation object
    or drm_syncobj. If any of those are NULL then there is a bug
    somewhere else.

Signed-off-by: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-04-08 16:48:19 -04:00
Arunpravin Paneer Selvam
9ed335d939 drm/amdgpu: Add mqd for userq compute queue
Add mqd for userq compute queue for gfx11/gfx12

Signed-off-by: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-04-08 16:48:19 -04:00
Shashank Sharma
31f7efcdca drm/amdgpu: enable eviction fence
This patch enables attachment and detachment of eviction fences.
This is just a fork of eviction fence enabling code from the first
patch of the series so that the CI testing can happen on fully
fledged code.

Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Christian Koenig <christian.koenig@amd.com>
Reviewed-by: Christian Koenig <christian.koenig@amd.com>
Signed-off-by: Shashank Sharma <shashank.sharma@amd.com>
Signed-off-by: Arvind Yadav <arvind.yadav@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-04-08 16:48:19 -04:00