Commit Graph

117677 Commits

Author SHA1 Message Date
YiPeng Chai
fd98319f73 drm/amd/ras: Add ras aca parser v1.0
Add ras aca parser v1.0.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-13 14:14:28 -04:00
Sunil Khatri
b1dd0db1c6 drm/amdgpu: clean up amdgpu hmm range functions
Clean up the amdgpu hmm range functions for clearer
definition of each.

a. Split amdgpu_ttm_tt_get_user_pages_done into two:
   1. amdgpu_hmm_range_valid: To check if the user pages
      are valid and update seq num
   2. amdgpu_hmm_range_free: Clean up the hmm range
      and pfn memory.

b. amdgpu_ttm_tt_get_user_pages_done and
   amdgpu_ttm_tt_discard_user_pages are similar function so remove
   discard and directly use amdgpu_hmm_range_free to clean up the
   hmm range and pfn memory.

Suggested-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-13 14:14:28 -04:00
Sunil Khatri
e095b55155 drm/amdgpu: use user provided hmm_range buffer in amdgpu_ttm_tt_get_user_pages
update the amdgpu_ttm_tt_get_user_pages and all dependent function
along with it callers to use a user allocated hmm_range buffer instead
hmm layer allocates the buffer.

This is a need to get hmm_range pointers easily accessible
without accessing the bo and that is a requirement for the
userqueue to lock the userptrs effectively.

Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-13 14:14:28 -04:00
Jonathan Kim
079ae5118e drm/amdkfd: fix suspend/resume all calls in mes based eviction path
Suspend/resume all gangs should be done with the device lock is held.

Signed-off-by: Jonathan Kim <jonathan.kim@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Harish Kasiviswanathan <harish.kasiviswanathan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-13 14:14:28 -04:00
Jonathan Kim
277bb0f83e drm/amdgpu: enable suspend/resume all for gfx 12
Suspend/resume all gangs has been available for GFX12 for a while now
so enable it.

Signed-off-by: Jonathan Kim <jonathan.kim@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-13 14:14:28 -04:00
Jonathan Kim
0ef930e1fa drm/amdgpu: fix hung reset queue array memory allocation
By design the MES will return an array result that is twice the number
of hung doorbells it can report.

i.e. if up k reported doorbells are supported, then the
second half of the array, also of length k, holds the HQD information
(type/queue/pipe) where queue 1 corresponds to index 0 and k,
queue 2 corresponds to index 1 and k + 1 etc ...

The driver will use the HDQ info to target queue/pipe reset for
hardware scheduled user compute queues.

Signed-off-by: Jonathan Kim <jonathan.kim@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-13 14:14:28 -04:00
Jonathan Kim
8745ca5efb drm/amdgpu: fix initialization of doorbell array for detect and hang
Initialized doorbells should be set to invalid rather than 0 to prevent
driver from over counting hung doorbells since it checks against the
invalid value to begin with.

Signed-off-by: Jonathan Kim <jonathan.kim@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-13 14:14:28 -04:00
Jonathan Kim
d0de79f66a drm/amdgpu: fix gfx12 mes packet status return check
GFX12 MES uses low 32 bits of status return for success (1 or 0)
and high bits for debug information if low bits are 0.

GFX11 MES doesn't do this so checking full 64-bit status return
for 1 or 0 is still valid.

Signed-off-by: Jonathan Kim <jonathan.kim@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
2025-10-13 14:14:16 -04:00
Jesse.Zhang
883f309add drm/amdgpu: Fix NULL pointer dereference in VRAM logic for APU devices
Previously, APU platforms (and other scenarios with uninitialized VRAM managers)
triggered a NULL pointer dereference in `ttm_resource_manager_usage()`. The root
cause is not that the `struct ttm_resource_manager *man` pointer itself is NULL,
but that `man->bdev` (the backing device pointer within the manager) remains
uninitialized (NULL) on APUs—since APUs lack dedicated VRAM and do not fully
set up VRAM manager structures. When `ttm_resource_manager_usage()` attempts to
acquire `man->bdev->lru_lock`, it dereferences the NULL `man->bdev`, leading to
a kernel OOPS.

1. **amdgpu_cs.c**: Extend the existing bandwidth control check in
   `amdgpu_cs_get_threshold_for_moves()` to include a check for
   `ttm_resource_manager_used()`. If the manager is not used (uninitialized
   `bdev`), return 0 for migration thresholds immediately—skipping VRAM-specific
   logic that would trigger the NULL dereference.

2. **amdgpu_kms.c**: Update the `AMDGPU_INFO_VRAM_USAGE` ioctl and memory info
   reporting to use a conditional: if the manager is used, return the real VRAM
   usage; otherwise, return 0. This avoids accessing `man->bdev` when it is
   NULL.

3. **amdgpu_virt.c**: Modify the vf2pf (virtual function to physical function)
   data write path. Use `ttm_resource_manager_used()` to check validity: if the
   manager is usable, calculate `fb_usage` from VRAM usage; otherwise, set
   `fb_usage` to 0 (APUs have no discrete framebuffer to report).

This approach is more robust than APU-specific checks because it:
- Works for all scenarios where the VRAM manager is uninitialized (not just APUs),
- Aligns with TTM's design by using its native helper function,
- Preserves correct behavior for discrete GPUs (which have fully initialized
  `man->bdev` and pass the `ttm_resource_manager_used()` check).

v4: use ttm_resource_manager_used(&adev->mman.vram_mgr.manager) instead of checking the adev->gmc.is_app_apu flag (Christian)

Reviewed-by: Christian König <christian.koenig@amd.com>
Suggested-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-13 14:14:15 -04:00
Christian König
33cc891b56 drm/amdgpu: hide VRAM sysfs attributes on GPUs without VRAM
Otherwise accessing them can cause a crash.

Signed-off-by: Christian König <christian.koenig@amd.com>
Tested-by: Mangesh Gadre <Mangesh.Gadre@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-13 14:14:15 -04:00
Sathishkumar S
74de0eaa00 drm/amdgpu: fix bit shift logic
BIT_ULL(n) sets nth bit, remove explicit shift and set the position

Fixes: a7a411e246 ("drm/amdgpu: fix shift-out-of-bounds in amdgpu_debugfs_jpeg_sched_mask_set")
Signed-off-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Reviewed-by: Leo Liu <leo.liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-13 14:14:15 -04:00
Timur Kristóf
6917112af2 drm/amd/powerplay: Fix CIK shutdown temperature
Remove extra multiplication.

CIK GPUs such as Hawaii appear to use PP_TABLE_V0 in which case
the shutdown temperature is hardcoded in smu7_init_dpm_defaults
and is already multiplied by 1000. The value was mistakenly
multiplied another time by smu7_get_thermal_temperature_range.

Fixes: 4ba082572a ("drm/amd/powerplay: export the thermal ranges of VI asics (V2)")
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/1676
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Timur Kristóf <timur.kristof@gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-13 14:14:15 -04:00
Gui-Dong Han
6df8e84aa6 drm/amdgpu: use atomic functions with memory barriers for vm fault info
The atomic variable vm_fault_info_updated is used to synchronize access to
adev->gmc.vm_fault_info between the interrupt handler and
get_vm_fault_info().

The default atomic functions like atomic_set() and atomic_read() do not
provide memory barriers. This allows for CPU instruction reordering,
meaning the memory accesses to vm_fault_info and the vm_fault_info_updated
flag are not guaranteed to occur in the intended order. This creates a
race condition that can lead to inconsistent or stale data being used.

The previous implementation, which used an explicit mb(), was incomplete
and inefficient. It failed to account for all potential CPU reorderings,
such as the access of vm_fault_info being reordered before the atomic_read
of the flag. This approach is also more verbose and less performant than
using the proper atomic functions with acquire/release semantics.

Fix this by switching to atomic_set_release() and atomic_read_acquire().
These functions provide the necessary acquire and release semantics,
which act as memory barriers to ensure the correct order of operations.
It is also more efficient and idiomatic than using explicit full memory
barriers.

Fixes: b97dfa27ef ("drm/amdgpu: save vm fault information for amdkfd")
Cc: stable@vger.kernel.org
Signed-off-by: Gui-Dong Han <hanguidong02@gmail.com>
Signed-off-by: Felix Kuehling <felix.kuehling@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-13 14:14:15 -04:00
Alex Deucher
ff780f4f80 drm/amdgpu: set an error on all fences from a bad context
When we backup ring contents to reemit after a queue reset,
we don't backup ring contents from the bad context.  When
we signal the fences, we should set an error on those
fences as well.

v2: misc cleanups
v3: add locking for fence error, fix comment (Christian)
v4: fix wrap around, locking (Christian)

Fixes: 77cc0da39c ("drm/amdgpu: track ring state associated with a fence")
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-13 14:14:15 -04:00
Alex Deucher
1f22fcb88b drm/amdgpu: handle wrap around in reemit handling
Compare the sequence numbers directly.

Fixes: 77cc0da39c ("drm/amdgpu: track ring state associated with a fence")
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-13 14:14:15 -04:00
Alex Deucher
357d90be2c drm/amdgpu: fix handling of harvesting for ip_discovery firmware
Chips which use the IP discovery firmware loaded by the driver
reported incorrect harvesting information in the ip discovery
table in sysfs because the driver only uses the ip discovery
firmware for populating sysfs and not for direct parsing for the
driver itself as such, the fields that are used to print the
harvesting info in sysfs report incorrect data for some IPs.  Populate
the relevant fields for this case as well.

Fixes: 514678da56 ("drm/amdgpu/discovery: fix fw based ip discovery")
Acked-by: Tom St Denis <tom.stdenis@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-13 14:14:15 -04:00
Christian König
8f74c70be5 drm/amdgpu: block CE CS if not explicitely allowed by module option
The Constant Engine found on gfx6-gfx10 HW has been a notorious source of
problems.

RADV never used it in the first place, radeonsi only used it for a few
releases around 2017 for gfx6-gfx9 before dropping support for it as
well.

While investigating another problem I just recently found that submitting
to the CE seems to be completely broken on gfx9 for quite a while.

Since nobody complained about that problem it most likely means that
nobody is using any of the affected radeonsi versions on current Linux
kernels any more.

So to potentially phase out the support for the CE and eliminate another
source of problems block submitting CE IBs unless it is enabled again
using a debug flag.

Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Acked-by: Timur Kristóf <timur.kristof@gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-13 14:14:14 -04:00
Christian König
5d55ed19d4 drm/amdgpu: remove two invalid BUG_ON()s
Those can be triggered trivially by userspace.

Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Acked-by: Timur Kristóf <timur.kristof@gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-13 14:14:14 -04:00
Timur Kristóf
7bdd91abf0 drm/amd: Disable ASPM on SI
Enabling ASPM causes randoms hangs on Tahiti and Oland on Zen4.
It's unclear if this is a platform-specific or GPU-specific issue.
Disable ASPM on SI for the time being.

Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Timur Kristóf <timur.kristof@gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-13 14:14:14 -04:00
Timur Kristóf
5c05bcf6ae drm/amd/pm: Disable MCLK switching on SI at high pixel clocks
On various SI GPUs, a flickering can be observed near the bottom
edge of the screen when using a single 4K 60Hz monitor over DP.
Disabling MCLK switching works around this problem.

Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Timur Kristóf <timur.kristof@gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-13 14:14:14 -04:00
Matthew Schwartz
9858ea4c29 Revert "drm/amd/display: Only restore backlight after amdgpu_dm_init or dm_resume"
This fix regressed the original issue that commit 7875afafba
("drm/amd/display: Fix brightness level not retained over reboot") solved,
so revert it until a different approach to solve the regression that
it caused with AMD_PRIVATE_COLOR is found.

Fixes: a490c8d77d ("drm/amd/display: Only restore backlight after amdgpu_dm_init or dm_resume")
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4620
Cc: stable@vger.kernel.org
Signed-off-by: Matthew Schwartz <matthew.schwartz@linux.dev>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-13 14:14:14 -04:00
Linus Torvalds
284fc30e66 Merge tag 'drm-next-2025-10-11-1' of https://gitlab.freedesktop.org/drm/kernel
Pull more drm fixes from Dave Airlie:
 "Just the follow up fixes for rc1 from the next branch, amdgpu and xe
  mostly with a single v3d fix in there.

  amdgpu:
   - DC DCE6 fixes
   - GPU reset fixes
   - Secure diplay messaging cleanup
   - MES fix
   - GPUVM locking fixes
   - PMFW messaging cleanup
   - PCI US/DS switch handling fix
   - VCN queue reset fix
   - DC FPU handling fix
   - DCN 3.5 fix
   - DC mirroring fix

  amdkfd:
   - Fix kfd process ref leak
   - mmap write lock handling fix
   - Fix comments in IOCTL

  xe:
   - Fix build with clang 16
   - Fix handling of invalid configfs syntax usage and spell out the
     expected syntax in the documentation
   - Do not try late bind firmware when running as VF since it shouldn't
     handle firmware loading
   - Fix idle assertion for local BOs
   - Fix uninitialized variable for late binding
   - Do not require perfmon_capable to expose free memory at page
     granularity. Handle it like other drm drivers do
   - Fix lock handling on suspend error path
   - Fix I2C controller resume after S3

  v3d:
   - fix fence locking"

* tag 'drm-next-2025-10-11-1' of https://gitlab.freedesktop.org/drm/kernel: (34 commits)
  drm/amd/display: Incorrect Mirror Cositing
  drm/amd/display: Enable Dynamic DTBCLK Switch
  drm/amdgpu: Report individual reset error
  drm/amdgpu: partially revert "revert to old status lock handling v3"
  drm/amd/display: Fix unsafe uses of kernel mode FPU
  drm/amd/pm: Disable VCN queue reset on SMU v13.0.6 due to regression
  drm/amdgpu: Fix general protection fault in amdgpu_vm_bo_reset_state_machine
  drm/amdgpu: Check swus/ds for switch state save
  drm/amdkfd: Fix two comments in kfd_ioctl.h
  drm/amd/pm: Avoid interface mismatch messaging
  drm/amdgpu: Merge amdgpu_vm_set_pasid into amdgpu_vm_init
  drm/amd/amdgpu: Fix the mes version that support inv_tlbs
  drm/amd: Check whether secure display TA loaded successfully
  drm/amdkfd: Fix mmap write lock not release
  drm/amdkfd: Fix kfd process ref leaking when userptr unmapping
  drm/amdgpu: Fix for GPU reset being blocked by KIQ I/O.
  drm/amd/display: Disable scaling on DCE6 for now
  drm/amd/display: Properly disable scaling on DCE6
  drm/amd/display: Properly clear SCL_*_FILTER_CONTROL on DCE6
  drm/amd/display: Add missing DCE6 SCL_HORZ_FILTER_INIT* SRIs
  ...
2025-10-10 14:02:14 -07:00
Linus Torvalds
1e5d41b981 Merge tag 'drm-fixes-2025-10-11' of https://gitlab.freedesktop.org/drm/kernel
Pull drm fixes from Dave Airlie:
 "Some fixes leftover from our fixes branch, just nouveau and vmwgfx:

  nouveau:
   - Return errno code from TTM move helper

  vmwgfx:
   - Fix null-ptr access in cursor code
   - Fix UAF in validation
   - Use correct iterator in validation"

* tag 'drm-fixes-2025-10-11' of https://gitlab.freedesktop.org/drm/kernel:
  drm/nouveau: fix bad ret code in nouveau_bo_move_prep
  drm/vmwgfx: Fix copy-paste typo in validation
  drm/vmwgfx: Fix Use-after-free in validation
  drm/vmwgfx: Fix a null-ptr access in the cursor snooper
2025-10-10 13:59:38 -07:00
Dave Airlie
5ca5f00a16 Merge tag 'drm-misc-fixes-2025-10-09' of https://gitlab.freedesktop.org/drm/misc/kernel into drm-fixes
Short summary of fixes pull:

nouveau:
- Return errno code from TTM move helper

vmwgfx:
- Fix null-ptr access in cursor code
- Fix UAF in validation
- Use correct iterator in validation

Signed-off-by: Dave Airlie <airlied@redhat.com>

From: Thomas Zimmermann <tzimmermann@suse.de>
Link: https://lore.kernel.org/r/20251009120004.GA17570@linux.fritz.box
2025-10-11 06:17:13 +10:00
Dave Airlie
c4b6ddcf01 Merge tag 'amd-drm-next-6.18-2025-10-09' of https://gitlab.freedesktop.org/agd5f/linux into drm-next
amd-drm-next-6.18-2025-10-09:

amdgpu:
- DC DCE6 fixes
- GPU reset fixes
- Secure diplay messaging cleanup
- MES fix
- GPUVM locking fixes
- PMFW messaging cleanup
- PCI US/DS switch handling fix
- VCN queue reset fix
- DC FPU handling fix
- DCN 3.5 fix
- DC mirroring fix

amdkfd:
- Fix kfd process ref leak
- mmap write lock handling fix
- Fix comments in IOCTL

Signed-off-by: Dave Airlie <airlied@redhat.com>

From: Alex Deucher <alexander.deucher@amd.com>
Link: https://lore.kernel.org/r/20251009162915.981503-1-alexander.deucher@amd.com
2025-10-10 06:57:56 +10:00
Shuhao Fu
e4bea91958 drm/nouveau: fix bad ret code in nouveau_bo_move_prep
In `nouveau_bo_move_prep`, if `nouveau_mem_map` fails, an error code
should be returned. Currently, it returns zero even if vmm addr is not
correctly mapped.

Cc: stable@vger.kernel.org
Reviewed-by: Petr Vorel <pvorel@suse.cz>
Signed-off-by: Shuhao Fu <sfual@cse.ust.hk>
Fixes: 9ce523cc3b ("drm/nouveau: separate buffer object backing memory from nvkm structures")
Signed-off-by: Danilo Krummrich <dakr@kernel.org>
2025-10-07 22:15:25 +02:00
Jesse Agate
d07e142641 drm/amd/display: Incorrect Mirror Cositing
[WHY]
hinit/vinit are incorrect in the case of mirroring.

[HOW]
Cositing sign must be flipped when image is mirrored in the vertical
or horizontal direction.

Cc: Mario Limonciello <mario.limonciello@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
Reviewed-by: Samson Tam <samson.tam@amd.com>
Signed-off-by: Jesse Agate <jesse.agate@amd.com>
Signed-off-by: Brendan Leder <breleder@amd.com>
Signed-off-by: Alex Hung <alex.hung@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-07 14:09:20 -04:00
Fangzhi Zuo
5949e7c489 drm/amd/display: Enable Dynamic DTBCLK Switch
[WHAT]
Since dcn35, DTBCLK can be disabled when no DP2 sink connected for
power saving purpose.

Cc: Mario Limonciello <mario.limonciello@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
Reviewed-by: Aurabindo Pillai <aurabindo.pillai@amd.com>
Signed-off-by: Fangzhi Zuo <Jerry.Zuo@amd.com>
Signed-off-by: Alex Hung <alex.hung@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-07 14:09:19 -04:00
Lijo Lazar
2e97663760 drm/amdgpu: Report individual reset error
If reinitialization of one of the GPUs fails after reset, it logs
failure on all subsequent GPUs eventhough they have resumed
successfully.

A sample log where only device at 0000:95:00.0 had a failure -

	amdgpu 0000:15:00.0: amdgpu: GPU reset(19) succeeded!
	amdgpu 0000:65:00.0: amdgpu: GPU reset(19) succeeded!
	amdgpu 0000:75:00.0: amdgpu: GPU reset(19) succeeded!
	amdgpu 0000:85:00.0: amdgpu: GPU reset(19) succeeded!
	amdgpu 0000:95:00.0: amdgpu: GPU reset(19) failed
	amdgpu 0000:e5:00.0: amdgpu: GPU reset(19) failed
	amdgpu 0000:f5:00.0: amdgpu: GPU reset(19) failed
	amdgpu 0000:05:00.0: amdgpu: GPU reset(19) failed
	amdgpu 0000:15:00.0: amdgpu: GPU reset end with ret = -5

To avoid confusion, report the error for each device
separately and return the first error as the overall result.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Asad Kamal <asad.kamal@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-07 14:09:19 -04:00
Christian König
a107aeb6a2 drm/amdgpu: partially revert "revert to old status lock handling v3"
The CI systems are pointing out list corruptions, so we still need to
fix something here.

Keep the asserts, but revert the lock changes for now.

Fixes: 59e4405e9e ("drm/amdgpu: revert to old status lock handling v3")
Signed-off-by: Christian König <christian.koenig@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-07 14:09:19 -04:00
Ard Biesheuvel
ddbfac1528 drm/amd/display: Fix unsafe uses of kernel mode FPU
The point of isolating code that uses kernel mode FPU in separate
compilation units is to ensure that even implicit uses of, e.g., SIMD
registers for spilling occur only in a context where this is permitted,
i.e., from inside a kernel_fpu_begin/end block.

This is important on arm64, which uses -mgeneral-regs-only to build all
kernel code, with the exception of such compilation units where FP or
SIMD registers are expected to be used. Given that the compiler may
invent uses of FP/SIMD anywhere in such a unit, none of its code may be
accessible from outside a kernel_fpu_begin/end block.

This means that all callers into such compilation units must use the
DC_FP start/end macros, which must not occur there themselves. For
robustness, all functions with external linkage that reside there should
call dc_assert_fp_enabled() to assert that the FPU context was set up
correctly.

Fix this for the DCN35, DCN351 and DCN36 implementations.

Cc: Austin Zheng <austin.zheng@amd.com>
Cc: Jun Lei <jun.lei@amd.com>
Cc: Harry Wentland <harry.wentland@amd.com>
Cc: Leo Li <sunpeng.li@amd.com>
Cc: Rodrigo Siqueira <siqueira@igalia.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: "Christian König" <christian.koenig@amd.com>
Cc: amd-gfx@lists.freedesktop.org
Cc: dri-devel@lists.freedesktop.org
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-07 14:09:19 -04:00
Jesse.Zhang
bd8acfcfce drm/amd/pm: Disable VCN queue reset on SMU v13.0.6 due to regression
Disable VCN reset capability for the program 4 as it's
causing regressions.

Fixes: 9d20f37a10 ("drm/amd/pm: Add VCN reset support for SMU v13.0.6")
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Jesse Zhang <jesse.zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-07 14:09:19 -04:00
Jesse.Zhang
8d557eab3a drm/amdgpu: Fix general protection fault in amdgpu_vm_bo_reset_state_machine
After GPU reset with VRAM loss, a general protection fault occurs
during user queue restoration when accessing vm_bo->vm after
spinlock release in amdgpu_vm_bo_reset_state_machine.

The root cause is that vm_bo points to the last entry from the
list_for_each_entry loop, but this becomes invalid after the
spinlock is released. Accessing vm_bo->vm at this point leads
to memory corruption.

Crash log shows:
[  326.981811] Oops: general protection fault, probably for non-canonical address 0x4156415741e58ac8: 0000 [#1] SMP NOPTI
[  326.981820] CPU: 13 UID: 0 PID: 1035 Comm: kworker/13:3 Tainted: G            E       6.16.0+ #25 PREEMPT(voluntary)
[  326.981826] Tainted: [E]=UNSIGNED_MODULE
[  326.981827] Hardware name: Gigabyte Technology Co., Ltd. X870E AORUS PRO ICE/X870E AORUS PRO ICE, BIOS F3i 12/19/2024
[  326.981831] Workqueue: events amdgpu_userq_restore_worker [amdgpu]
[  326.981999] RIP: 0010:amdgpu_vm_assert_locked+0x16/0x70 [amdgpu]
[  326.982094] Code: 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 48 85 ff 74 45 48 8b 87 80 03 00 00 48 85 c0 74 40 <48> 8b b8 80 01 00 00 48 85 ff 74 3b 8b 05 0c b7 0e f0 85 c0 75 05
[  326.982098] RSP: 0018:ffffaa91c2a6bc20 EFLAGS: 00010206
[  326.982100] RAX: 4156415741e58948 RBX: ffff9e8f013e8330 RCX: 0000000000000000
[  326.982102] RDX: 0000000000000005 RSI: 000000001d254e88 RDI: ffffffffc144814a
[  326.982104] RBP: ffffaa91c2a6bc68 R08: 0000004c21a25674 R09: 0000000000000001
[  326.982106] R10: 0000000000000001 R11: dccaf3f2f82863fc R12: ffff9e8f013e8000
[  326.982108] R13: ffff9e8f013e8000 R14: 0000000000000000 R15: ffff9e8f09980000
[  326.982110] FS:  0000000000000000(0000) GS:ffff9e9e79995000(0000) knlGS:0000000000000000
[  326.982112] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  326.982114] CR2: 000055ed6c9caa80 CR3: 0000000797060000 CR4: 0000000000750ef0
[  326.982116] PKRU: 55555554

Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-07 14:09:19 -04:00
Lijo Lazar
9b608fe948 drm/amdgpu: Check swus/ds for switch state save
For saving switch state, check if the GPU is having SWUS/DS
architecture. Otherwise, skip saving.

Reported-by: Roman Elshin <roman.elshin@gmail.com>
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4602
Fixes: 1dd2fa0e00 ("drm/amdgpu: Save and restore switch state")
Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-07 14:09:19 -04:00
Lijo Lazar
4538a93bbb drm/amd/pm: Avoid interface mismatch messaging
PMFW interface version is not used by some IP implementations like SMU
v13.0.6/12, instead rely on PMFW version checks. Avoid the log if
interface version is not used.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Asad Kamal <asad.kamal@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-07 14:09:19 -04:00
Jesse.Zhang
b809ca91a5 drm/amdgpu: Merge amdgpu_vm_set_pasid into amdgpu_vm_init
As KFD no longer uses a separate PASID, the global amdgpu_vm_set_pasid()function is no longer necessary.
Merge its functionality directly intoamdgpu_vm_init() to simplify code flow and eliminate redundant locking.

v2: remove superflous check
  adjust amdgpu_vm_fin and remove amdgpu_vm_set_pasid (Chritian)

v3: drop amdgpu_vm_assert_locked (Chritian)

Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4614
Fixes: 59e4405e9e ("drm/amdgpu: revert to old status lock handling v3")
Reviewed-by: Christian König <christian.koenig@amd.com>
Suggested-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-07 14:09:07 -04:00
Shaoyun Liu
8dbac5cf8b drm/amd/amdgpu: Fix the mes version that support inv_tlbs
MES version 0x83 is not stable to use the inv_tlbs API. Defer it to 0x84 vertsion.

Fixes: 85442bac84 ("drm/amd/amdgpu: Fix the mes version that support inv_tlbs")
Signed-off-by: Shaoyun Liu <shaoyun.liu@amd.com>
Reviewed-by: Michael Chen <michael.chen@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-07 14:09:06 -04:00
Mario Limonciello
c760bcda83 drm/amd: Check whether secure display TA loaded successfully
[Why]
Not all renoir hardware supports secure display.  If the TA is present
but the feature isn't supported it will fail to load or send commands.
This shows ERR messages to the user that make it seems like there is
a problem.

[How]
Check the resp_status of the context to see if there was an error
before trying to send any secure display commands.

Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/1415
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-07 14:09:06 -04:00
Philip Yang
7574f30337 drm/amdkfd: Fix mmap write lock not release
If mmap write lock is taken while draining retry fault, mmap write lock
is not released because svm_range_restore_pages calls mmap_read_unlock
then returns. This causes deadlock and system hangs later because mmap
read or write lock cannot be taken.

Downgrade mmap write lock to read lock if draining retry fault fix this
bug.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-07 14:09:06 -04:00
Philip Yang
58e6fc2fb9 drm/amdkfd: Fix kfd process ref leaking when userptr unmapping
kfd_lookup_process_by_pid hold the kfd process reference to ensure it
doesn't get destroyed while sending the segfault event to user space.

Calling kfd_lookup_process_by_pid as function parameter leaks the kfd
process refcount and miss the NULL pointer check if app process is
already destroyed.

Fixes: 2d274bf709 ("amd/amdkfd: Trigger segfault for early userptr unmmapping")
Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-07 14:09:06 -04:00
Heng Zhou
0c67342885 drm/amdgpu: Fix for GPU reset being blocked by KIQ I/O.
There is some probability that reset workqueue is blocked by KIQ I/O for 10+ seconds after gpu hangs.
So we need to add a in_reset check during each KIQ register poll.

Signed-off-by: Heng Zhou <Heng.Zhou@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-07 14:09:06 -04:00
Timur Kristóf
0e190a0446 drm/amd/display: Disable scaling on DCE6 for now
Scaling doesn't work on DCE6 at the moment, the current
register programming produces incorrect output when using
fractional scaling (between 100-200%) on resolutions higher
than 1080p.

Disable it until we figure out how to program it properly.

Fixes: 7c15fd86aa ("drm/amd/display: dc/dce: add initial DCE6 support (v10)")
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Timur Kristóf <timur.kristof@gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-07 14:09:06 -04:00
Timur Kristóf
a7dc87f344 drm/amd/display: Properly disable scaling on DCE6
SCL_SCALER_ENABLE can be used to enable/disable the scaler
on DCE6. Program it to 0 when scaling isn't used, 1 when used.
Additionally, clear some other registers when scaling is
disabled and program the SCL_UPDATE register as recommended.

This fixes visible glitches for users whose BIOS sets up a
mode with scaling at boot, which DC was unable to clean up.

Fixes: b70aaf5586 ("drm/amd/display: dce_transform: add DCE6 specific macros,functions")
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Timur Kristóf <timur.kristof@gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-07 14:09:06 -04:00
Timur Kristóf
c0aa7cf49d drm/amd/display: Properly clear SCL_*_FILTER_CONTROL on DCE6
Previously, the code would set a bit field which didn't exist
on DCE6 so it would be effectively a no-op.

Fixes: b70aaf5586 ("drm/amd/display: dce_transform: add DCE6 specific macros,functions")
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Timur Kristóf <timur.kristof@gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-07 14:09:06 -04:00
Timur Kristóf
d60f9c45d1 drm/amd/display: Add missing DCE6 SCL_HORZ_FILTER_INIT* SRIs
Without these, it's impossible to program these registers.

Fixes: 102b2f587a ("drm/amd/display: dce_transform: DCE6 Scaling Horizontal Filter Init (v2)")
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Timur Kristóf <timur.kristof@gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-07 14:09:06 -04:00
Alex Deucher
507296328b drm/amdgpu: Add additional DCE6 SCL registers
Fixes: 102b2f587a ("drm/amd/display: dce_transform: DCE6 Scaling Horizontal Filter Init (v2)")
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-07 14:09:06 -04:00
Linus Torvalds
2215336295 Merge tag 'hyperv-next-signed-20251006' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux
Pull hyperv updates from Wei Liu:

 - Unify guest entry code for KVM and MSHV (Sean Christopherson)

 - Switch Hyper-V MSI domain to use msi_create_parent_irq_domain()
   (Nam Cao)

 - Add CONFIG_HYPERV_VMBUS and limit the semantics of CONFIG_HYPERV
   (Mukesh Rathor)

 - Add kexec/kdump support on Azure CVMs (Vitaly Kuznetsov)

 - Deprecate hyperv_fb in favor of Hyper-V DRM driver (Prasanna
   Kumar T S M)

 - Miscellaneous enhancements, fixes and cleanups (Abhishek Tiwari,
   Alok Tiwari, Nuno Das Neves, Wei Liu, Roman Kisel, Michael Kelley)

* tag 'hyperv-next-signed-20251006' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
  hyperv: Remove the spurious null directive line
  MAINTAINERS: Mark hyperv_fb driver Obsolete
  fbdev/hyperv_fb: deprecate this in favor of Hyper-V DRM driver
  Drivers: hv: Make CONFIG_HYPERV bool
  Drivers: hv: Add CONFIG_HYPERV_VMBUS option
  Drivers: hv: vmbus: Fix typos in vmbus_drv.c
  Drivers: hv: vmbus: Fix sysfs output format for ring buffer index
  Drivers: hv: vmbus: Clean up sscanf format specifier in target_cpu_store()
  x86/hyperv: Switch to msi_create_parent_irq_domain()
  mshv: Use common "entry virt" APIs to do work in root before running guest
  entry: Rename "kvm" entry code assets to "virt" to genericize APIs
  entry/kvm: KVM: Move KVM details related to signal/-EINTR into KVM proper
  mshv: Handle NEED_RESCHED_LAZY before transferring to guest
  x86/hyperv: Add kexec/kdump support on Azure CVMs
  Drivers: hv: Simplify data structures for VMBus channel close message
  Drivers: hv: util: Cosmetic changes for hv_utils_transport.c
  mshv: Add support for a new parent partition configuration
  clocksource: hyper-v: Skip unnecessary checks for the root partition
  hyperv: Add missing field to hv_output_map_device_interrupt
2025-10-07 08:40:15 -07:00
Dave Airlie
73bc073d42 Merge tag 'drm-xe-next-fixes-2025-10-03' of https://gitlab.freedesktop.org/drm/xe/kernel into drm-next
Cross-subsystem Changes:
 - Fix userptr to not allow device private pages with SVM (Thomas
   Hellström)

Driver Changes:
 - Fix build with clang 16 (Michal Wajdeczko)
 - Fix handling of invalid configfs syntax usage and spell out the
   expected syntax in the documentation (Lucas De Marchi)
 - Do not try late bind firmware when running as VF since it
   shouldn't handle firmware loading (Michal Wajdeczko)
 - Fix idle assertion for local BOs (Thomas Hellström)
 - Fix uninitialized variable for late binding (Colin Ian King,
   Mallesh Koujalagi)
 - Do not require perfmon_capable to expose free memory at page
   granularity. Handle it like other drm drivers do (Matthew Auld)
 - Fix lock handling on suspend error path (Shuicheng Lin)
 - Fix I2C controller resume after S3 (Raag Jadav)

Signed-off-by: Dave Airlie <airlied@redhat.com>

From: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://lore.kernel.org/r/q6yeyb7n2eqo5megxjqayooajirx5hhsntfo65m3y4myscz7oz@25qbabbbr4hj
2025-10-07 11:50:11 +10:00
Dave Airlie
bae04c9658 Merge tag 'drm-misc-next-fixes-2025-10-02' of https://gitlab.freedesktop.org/drm/misc/kernel into drm-next
Short summary of fixes pull:

v3d:
- Fix fence locking

Signed-off-by: Dave Airlie <airlied@redhat.com>

From: Thomas Zimmermann <tzimmermann@suse.de>
Link: https://lore.kernel.org/r/20251002122303.GA21323@linux.fritz.box
2025-10-07 11:35:57 +10:00
Ian Forbes
228c5d44df drm/vmwgfx: Fix copy-paste typo in validation
'entry' should be 'val' which is the loop iterator.

Fixes: 9e931f2e09 ("drm/vmwgfx: Refactor resource validation hashtable to use linux/hashtable implementation.")
Signed-off-by: Ian Forbes <ian.forbes@broadcom.com>
Reviewed-by: Zack Rusin <zack.rusin@broadcom.com>
Signed-off-by: Zack Rusin <zack.rusin@broadcom.com>
Link: https://lore.kernel.org/r/20250926195427.1405237-2-ian.forbes@broadcom.com
2025-10-06 11:57:59 -04:00