linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-05-16 23:02:27 -04:00

Author	SHA1	Message	Date
Shubhankar Milind Sardeshpande	47776ac1e3	drm/amdgpu: Avoid reset in AMDGPU unload path for APUs with GFX V11 and higher. GFX V11 has GC block as default off IP. Every time AMDGPU driver sends a request to PMFW to unload MP1, PMFW will put GC in reset and power down the voltage.Hence, skipping reset for APUs with GFX V11 or later to avoid reset related failures. Fixes: `34355e6183` ("drm/amdgpu: Fix GFX hang on SteamDeck when amdgpu is reloaded") Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Shubhankar Milind Sardeshpande <Shubhankar.MilindSardeshpande@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit d0a8cadffc818f51d05bc234d8da1af228bc59a3) Cc: stable@vger.kernel.org	2026-04-24 11:10:44 -04:00
Mario Limonciello	0e48f27d13	drm/amd: Adjust ASPM support quirk to cover more Intel hosts Some of the same issues identified in commit `c770ef1967` ("drm/amd/amdgpu: disable ASPM in some situations") also affect Tiger Lake systems with GFX11 connected over USB4. Widen the net to also match these hosts. Fixes: `d9b3a066df` ("drm/amd: Exclude dGPUs in eGPU enclosures from DPM quirks") Closes: https://gitlab.freedesktop.org/drm/amd/-/work_items/5145 Reviewed-by: Yang Wang <kevinyang.wang@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 0a214d888485b9f35fe03882a92962e6d5697849)	2026-04-21 17:03:01 -04:00
Ce Sun	169a0556d9	drm/amdgpu: correct single device PCIe reset flow for DPC For triggering the dpc event with a single device, we still need to set the in_link_reset flag and the dpc status. Signed-off-by: Ce Sun <cesun102@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-04-17 14:49:11 -04:00
Prike Liang	34f31fe40f	drm/amdgpu: rework userq fence driver alloc/destroy The correct fix is to tie the global xa entry lifetime to the queue lifetime: insert in amdgpu_userq_create() and erase in amdgpu_userq_cleanup(), both at the well-defined doorbell_index key, making the operation O(1) and resolve the fence driver UAF problem by binding the userq driver fence to per queue. v2: clean up the local variables initialization. (Christian) Signed-off-by: Prike Liang <Prike.Liang@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-04-03 13:59:28 -04:00
Marco Crivellari	53140a0d59	drm/amdgpu: replace use of system_wq with system_dfl_wq This patch continues the effort to refactor workqueue APIs, which has begun with the changes introducing new workqueues and a new alloc_workqueue flag: commit `128ea9f6cc` ("workqueue: Add system_percpu_wq and system_dfl_wq") commit `930c2ea566` ("workqueue: Add new WQ_PERCPU flag") The point of the refactoring is to eventually alter the default behavior of workqueues to become unbound by default so that their workload placement is optimized by the scheduler. Before that to happen after a careful review and conversion of each individual case, workqueue users must be converted to the better named new workqueues with no intended behaviour changes: system_wq -> system_percpu_wq system_unbound_wq -> system_dfl_wq This way the old obsolete workqueues (system_wq, system_unbound_wq) can be removed in the future. Suggested-by: Tejun Heo <tj@kernel.org> Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-04-03 13:52:29 -04:00
Marco Crivellari	7f0fc00368	drm/amdgpu: replace use of system_unbound_wq with system_dfl_wq This patch continues the effort to refactor workqueue APIs, which has begun with the changes introducing new workqueues and a new alloc_workqueue flag: commit `128ea9f6cc` ("workqueue: Add system_percpu_wq and system_dfl_wq") commit `930c2ea566` ("workqueue: Add new WQ_PERCPU flag") The point of the refactoring is to eventually alter the default behavior of workqueues to become unbound by default so that their workload placement is optimized by the scheduler. Before that to happen after a careful review and conversion of each individual case, workqueue users must be converted to the better named new workqueues with no intended behaviour changes: system_wq -> system_percpu_wq system_unbound_wq -> system_dfl_wq This way the old obsolete workqueues (system_wq, system_unbound_wq) can be removed in the future. Suggested-by: Tejun Heo <tj@kernel.org> Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-04-03 13:52:12 -04:00
Jesse Zhang	bf89091035	drm/amdgpu: flush coredump work before HW teardown In amdgpu_device_fini_hw(), deferred coredump formatting work may still be pending when hardware and IP components are being torn down. Since the work may access device registers and memory that will be freed or powered off, it must be completed before proceeding. Add a flush_work() call for adev->coredump_work, guarded by CONFIG_DEV_COREDUMP, to ensure any pending coredump work finishes before the device enters the early IP fini stage. This avoids potential use-after-free or accessing hardware resources that are no longer available. Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Suggested-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Jesse Zhang <jesse.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-03-30 14:32:12 -04:00
Ruijing Dong	94d79f51ef	drm/amdgpu: fix strsep() corrupting lockup_timeout on multi-GPU (v3) amdgpu_device_get_job_timeout_settings() passes a pointer directly to the global amdgpu_lockup_timeout[] buffer into strsep(). strsep() destructively replaces delimiter characters with '\0' in-place. On multi-GPU systems, this function is called once per device. When a multi-value setting like "0,0,0,-1" is used, the first GPU's call transforms the global buffer into "0\00\00\0-1". The second GPU then sees only "0" (terminated at the first '\0'), parses a single value, hits the single-value fallthrough (index == 1), and applies timeout=0 to all rings — causing immediate false job timeouts. Fix this by copying into a stack-local array before calling strsep(), so the global module parameter buffer remains intact across calls. The buffer is AMDGPU_MAX_TIMEOUT_PARAM_LENGTH (256) bytes, which is safe for the stack. v2: wrap commit message to 72 columns, add Assisted-by tag. v3: use stack array with strscpy() instead of kstrdup()/kfree() to avoid unnecessary heap allocation (Christian). This patch was developed with assistance from Claude (claude-opus-4-6). Assisted-by: Claude:claude-opus-4-6 Reviewed-by: Christian König <christian.koenig@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Ruijing Dong <ruijing.dong@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-03-23 14:09:28 -04:00
Pierre-Eric Pelloux-Prayer	4bbba79a7f	drm/amdgpu: move devcoredump generation to a worker Update the way drm_coredump_printer is used based on its documentation and Xe's code: the main idea is to generate the final version in one go and then use memcpy to return the chunks requested by the caller of amdgpu_devcoredump_read. The generation is moved to a separate worker thread. This cuts the time to copy the dump from 40s to ~0s on my machine. --- v3: - removed adev->coredump_in_progress and instead use work as the synchronisation mechanism - use kvfree instead of kfree --- Signed-off-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-03-17 10:45:20 -04:00
Hawking Zhang	be3f235bb6	drm/amdgpu: Query xgmi info from mmhub if available Query xgmi info from mmhub if available Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Le Ma <le.ma@amd.com> Reviewed-by: Feifei Xu <Feifei.Xu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-03-17 10:32:38 -04:00
Mario Limonciello	fae5984296	drm/amd: Set num IP blocks to 0 if discovery fails If discovery has failed for any reason (such as no support for a block) then there is no need to unwind all the IP blocks in fini. In this condition there can actually be failures during the unwind too. Reset num_ip_blocks to zero during failure path and skip the unnecessary cleanup path. Suggested-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-03-11 13:58:08 -04:00
Mario Limonciello	f5a05f8414	drm/amd: Fix a few more NULL pointer dereference in device cleanup I found a few more paths that cleanup fails due to a NULL version pointer on unsupported hardware. Add NULL checks as applicable. Fixes: `39fc2bc4da` ("drm/amdgpu: Protect GPU register accesses in powergated state in some paths") Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-03-06 16:31:54 -05:00
Mario Limonciello	b7ac77468c	drm/amd: Fix NULL pointer dereference in device cleanup When GPU initialization fails due to an unsupported HW block IP blocks may have a NULL version pointer. During cleanup in amdgpu_device_fini_hw, the code calls amdgpu_device_set_pg_state and amdgpu_device_set_cg_state which iterate over all IP blocks and access adev->ip_blocks[i].version without NULL checks, leading to a kernel NULL pointer dereference. Add NULL checks for adev->ip_blocks[i].version in both amdgpu_device_set_cg_state and amdgpu_device_set_pg_state to prevent dereferencing NULL pointers during GPU teardown when initialization has failed. Fixes: `39fc2bc4da` ("drm/amdgpu: Protect GPU register accesses in powergated state in some paths") Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-03-06 16:25:44 -05:00
Lijo Lazar	b2d5512471	drm/amdgpu: Move pcie lock to register block Move pcie register access lock to register access block. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-03-02 16:46:41 -05:00
Lijo Lazar	4a6ab03731	drm/amdgpu: Add pcie64 extended to register block Add extended pcie 64-bit access method to register access block. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-03-02 16:46:38 -05:00
Lijo Lazar	74b9c49e6d	drm/amdgpu: Add pcie64 indirect to register block Move 64-bit pcie indirect read/writes to register access block. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-03-02 16:46:34 -05:00
Lijo Lazar	5312d68a6b	drm/amdgpu: Add pcie ext access to register block Move pcie extended access (64-bit address) to register access block. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-03-02 16:46:31 -05:00
Lijo Lazar	e84d7e717b	drm/amdgpu: Add pcie indirect to register block Move pcie indirect access to register access block. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-03-02 16:46:28 -05:00
Lijo Lazar	5d82f451b0	drm/amdgpu: Add pciep method to register block Move pcie port method to register access block. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-03-02 16:46:19 -05:00
Lijo Lazar	72cc2e3011	drm/amdgpu: Add audio method to register block Move audio endpoint callbacks to register access block. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-03-02 16:46:16 -05:00
Lijo Lazar	b1a516a5df	drm/amdgpu: Add se cac method to register block Move se cac access callbacks to register access block. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-03-02 16:46:13 -05:00
Lijo Lazar	d2de787f2d	drm/amdgpu: Add gc cac method to register block Move gc cac access callbacks to register access block. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-03-02 16:46:06 -05:00
Lijo Lazar	4780a26a96	drm/amdgpu: Add didt method to register block Move didt callbacks to register access block. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-03-02 16:46:03 -05:00
Lijo Lazar	366201e790	drm/amdgpu: Add uvd indirect to register block Add uvd indirect method to register access block and replace the existing calls from adev. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-03-02 16:43:10 -05:00
Lijo Lazar	f4eb08f8b2	drm/amdgpu: Add smc method to register block Define register access block which consolidates different register access methods. Add smc method to register access block. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-03-02 16:42:38 -05:00
Lijo Lazar	57052d293e	drm/amdgpu: Move register access functions Move register access methods from amdgpu_device.c to a dedicated file. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-03-02 16:35:23 -05:00
Lijo Lazar	bb71362182	drm/amdgpu: Fix error handling in slot reset If the device has not recovered after slot reset is called, it goes to out label for error handling. There it could make decision based on uninitialized hive pointer and could result in accessing an uninitialized list. Initialize the list and hive properly so that it handles the error situation and also releases the reset domain lock which is acquired during error_detected callback. Fixes: `732c6cefc1` ("drm/amdgpu: Replace tmp_adev with hive in amdgpu_pci_slot_reset") Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Ce Sun <cesun102@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-02-25 16:56:33 -05:00
Timur Kristóf	c81f35cdba	drm/amdgpu: Use DC by default on CIK APUs Now that DC supports external DP bridge encoders, it has reached feature parity with the legacy non-DC display driver on CIK APUs: Kaveri, Kabini, Mullins. Use the DC display driver by default on SI APUs, unless it is explicitly disabled using the amdgpu.dc=0 module parameter. DC brings proper support for DP/HDMI audio, DP MST, VRR, 10-bit colors, some HDR features, atomic modesetting, etc. Signed-off-by: Timur Kristóf <timur.kristof@gmail.com> Reviewed-by: Alex Hung <alex.hung@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-02-23 14:28:32 -05:00
Alex Deucher	01fb6e8568	drm/amdgpu: don't call drm_sched_stop/start() in asic reset We only want to stop the work queues, not mess with the fences, etc. v2: add the job back to the pending list. v3: return the proper job status so scheduler adds the job back to the pending list Acked-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-02-23 14:16:31 -05:00
Pierre-Eric Pelloux-Prayer	79a0ff59a5	drm/amdgpu: move sched status check inside amdgpu_ttm_set_buffer_funcs_status It avoids duplicated code and allows to output a warning. --- v4: move check inside the existing if (enable) test --- Signed-off-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-02-23 14:16:30 -05:00
Linus Torvalds	bf4afc53b7	Convert 'alloc_obj' family to use the new default GFP_KERNEL argument This was done entirely with mindless brute force, using git grep -l '\<k[vmz]alloc_objs(., GFP_KERNEL)' \| xargs sed -i 's/\(alloc_objs(.*\), GFP_KERNEL)/\1)/' to convert the new alloc_obj() users that had a simple GFP_KERNEL argument to just drop that argument. Note that due to the extreme simplicity of the scripting, any slightly more complex cases spread over multiple lines would not be triggered: they definitely exist, but this covers the vast bulk of the cases, and the resulting diff is also then easier to check automatically. For the same reason the 'flex' versions will be done as a separate conversion. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2026-02-21 17:09:51 -08:00
Kees Cook	69050f8d6d	treewide: Replace kmalloc with kmalloc_obj for non-scalar types This is the result of running the Coccinelle script from scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to avoid scalar types (which need careful case-by-case checking), and instead replace kmalloc-family calls that allocate struct or union object instances: Single allocations: kmalloc(sizeof(TYPE), ...) are replaced with: kmalloc_obj(TYPE, ...) Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...) are replaced with: kmalloc_objs(TYPE, COUNT, ...) Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...) are replaced with: kmalloc_flex(PTR, FAM, COUNT, ...) (where TYPE may also be VAR) The resulting allocations no longer return "void ", instead returning "TYPE ". Signed-off-by: Kees Cook <kees@kernel.org>	2026-02-21 01:02:28 -08:00
Harish Kasiviswanathan	23c098b5fc	drm/amdgpu: Set atomics to true for xgmi xgmi support atomics between links. Set them to true. This only set for GFX12 onwards to avoid regression on older generations v2: Use correct xgmi flag that indicates CPU connection Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-02-19 12:16:11 -05:00
Kenneth Feng	d76c66c6ad	drm/amd/pm: send unload command to smu during modprobe -r amdgpu Send unload command to smu during modprobe -r amdgpu for smu 13/14. 1. This can fix the high voltage/temperatue issue after driver is unloaded. 2. Reloading driver could fail but with the debug port based mode1 reset during driver is reloaded, it is good and safe. Signed-off-by: Kenneth Feng <kenneth.feng@amd.com> Reviewed-by: Yang Wang <kevinyang.wang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-02-12 15:20:07 -05:00
Mario Limonciello	f7afda7fcd	drm/amd: Fix hang on amdgpu unload by using pci_dev_is_disconnected() The commit `6a23e7b433` ("drm/amd: Clean up kfd node on surprise disconnect") introduced early KFD cleanup when drm_dev_is_unplugged() returns true. However, this causes hangs during normal module unload (rmmod amdgpu). The issue occurs because drm_dev_unplug() is called in amdgpu_pci_remove() for all removal scenarios, not just surprise disconnects. This was done intentionally in commit `39934d3ed5` ("Revert "drm/amdgpu: TA unload messages are not actually sent to psp when amdgpu is uninstalled"") to fix IGT PCI software unplug test failures. As a result, drm_dev_is_unplugged() returns true even during normal module unload, triggering the early KFD cleanup inappropriately. The correct check should distinguish between: - Actual surprise disconnect (eGPU unplugged): pci_dev_is_disconnected() returns true - Normal module unload (rmmod): pci_dev_is_disconnected() returns false Replace drm_dev_is_unplugged() with pci_dev_is_disconnected() to ensure the early cleanup only happens during true hardware disconnect events. Cc: stable@vger.kernel.org Reported-by: Cal Peake <cp@absolutedigital.net> Closes: https://lore.kernel.org/all/b0c22deb-c0fa-3343-33cf-fd9a77d7db99@absolutedigital.net/ Fixes: `6a23e7b433` ("drm/amd: Clean up kfd node on surprise disconnect") Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-02-05 17:25:57 -05:00
Yifan Zhang	39fc2bc4da	drm/amdgpu: Protect GPU register accesses in powergated state in some paths Ungate GPU CG/PG in device_fini_hw and device_halt to protect GPU register accesses, e.g. GC registers are accessed in amdgpu_irq_disable_all() and amdgpu_fence_driver_hw_fini(). Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org	2026-02-05 17:20:33 -05:00
Perry Yuan	31b153315b	drm/amdgpu: ensure no_hw_access is visible before MMIO Add a full memory barrier after clearing no_hw_access in amdgpu_device_mode1_reset() so subsequent PCI state restore access cannot observe stale state on other CPUs. Fixes: `7edb503fe4` ("drm/amd/pm: Disable MMIO access during SMU Mode 1 reset") Signed-off-by: Perry Yuan <perry.yuan@amd.com> Reviewed-by: Yifan Zhang <yifan1.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-02-03 16:23:11 -05:00
Dave Airlie	6704d98a4f	BackMerge tag 'v6.19-rc7' into drm-next Linux 6.19-rc7 This is needed for msm and rust trees. Signed-off-by: Dave Airlie <airlied@redhat.com>	2026-01-28 12:44:28 +10:00
Dave Airlie	c098b1aa2f	Merge tag 'amd-drm-next-6.20-2026-01-16' of https://gitlab.freedesktop.org/agd5f/linux into drm-next amd-drm-next-6.20-2026-01-16: amdgpu: - SR-IOV fixes - Rework SMU mailbox handling - Drop MMIO_REMAP domain - UserQ fixes - MES cleanups - Panel Replay updates - HDMI fixes - Backlight fixes - SMU 14.x fixes - SMU 15 updates amdkfd: - Fix a memory leak - Fixes for systems with non-4K pages - LDS/Scratch cleanup - MES process eviction fix Signed-off-by: Dave Airlie <airlied@redhat.com> From: Alex Deucher <alexander.deucher@amd.com> Link: https://patch.msgid.link/20260116202609.23107-1-alexander.deucher@amd.com	2026-01-19 06:54:46 +10:00
Dave Airlie	83dc0ba275	Merge tag 'amd-drm-next-6.20-2026-01-09' of https://gitlab.freedesktop.org/agd5f/linux into drm-next amd-drm-next-6.20-2026-01-09: amdgpu: - GPUVM updates - Initial support for larger GPU address spaces - Initial SMUIO 15.x support - Documentation updates - Initial PSP 15.x support - Initial IH 7.1 support - Initial IH 6.1.1 support - SMU 13.0.12 updates - RAS updates - Initial MMHUB 3.4 support - Initial MMHUB 4.2 support - Initial GC 12.1 support - Initial GC 11.5.4 support - HDMI fixes - Panel replay improvements - DML updates - DC FP fixes - Initial SDMA 6.1.4 support - Initial SDMA 7.1 support - Userq updates - DC HPD refactor - SwSMU cleanups and refactoring - TTM memory ops parallelization - DCN 3.5 fixes - DP audio fixes - Clang fixes - Misc spelling fixes and cleanups - Initial SDMA 7.11.4 support - Convert legacy DRM logging helpers to new drm logging helpers - Initial JPEG 5.3 support - Add support for changing UMA size via the driver - DC analog fixes - GC 9 gfx queue reset support - Initial SMU 15.x support amdkfd: - Reserved SDMA rework - Refactor SPM - Initial GC 12.1 support - Initial GC 11.5.4 support - Initial SDMA 7.1 support - Initial SDMA 6.1.4 support - Increase the kfd process hash table - Per context support - Topology fixes radeon: - Convert legacy DRM logging helpers to new drm logging helpers - Use devm for i2c adapters - Variable sized array fix - Misc cleanups UAPI: - KFD context support. Proposed userspace: https://github.com/ROCm/rocm-systems/pull/1705 https://github.com/ROCm/rocm-systems/pull/1701 - Add userq metadata queries for more queue types. Proposed userspace: https://gitlab.freedesktop.org/yogeshmohan/mesa/-/commits/userq_query From: Alex Deucher <alexander.deucher@amd.com> Link: https://patch.msgid.link/20260109154713.3242957-1-alexander.deucher@amd.com Signed-off-by: Dave Airlie <airlied@redhat.com>	2026-01-15 14:49:33 +10:00
Mario Limonciello (AMD)	28695ca09d	drm/amd: Clean up kfd node on surprise disconnect When an eGPU is unplugged the KFD topology should also be destroyed for that GPU. This never happens because the fini_sw callbacks never get to run. Run them manually before calling amdgpu_device_ip_fini_early() when a device has already been disconnected. This location is intentionally chosen to make sure that the kfd locking refcount doesn't get incremented unintentionally. Cc: kent.russell@amd.com Closes: https://community.frame.work/t/amd-egpu-on-linux/8691/33 Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org> Reviewed-by: Kent Russell <kent.russell@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit `6a23e7b433`) Cc: stable@vger.kernel.org	2026-01-14 14:51:36 -05:00
Xiaogang Chen	6cca686dfc	drm/amdkfd: kfd driver supports hot unplug/replug amdgpu devices This patch allows kfd driver function correctly when AMD gpu devices got unplug/replug at run time. When an AMD gpu device got unplug kfd driver gracefully terminates existing kfd processes after stops all queues by sending SIGBUS to user process. After that user space can still use remaining AMD gpu devices. When all AMD gpu devices at system got removed kfd driver will not response new requests. Unplugged AMD gpu devices can be re-plugged. kfd driver will use added devices to function as usual. The purpose of this patch is having kfd driver behavior as expected during and after AMD gpu devices unplug/replug at run time. Signed-off-by: Xiaogang Chen <Xiaogang.Chen@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-01-14 14:28:49 -05:00
Mario Limonciello (AMD)	6a23e7b433	drm/amd: Clean up kfd node on surprise disconnect When an eGPU is unplugged the KFD topology should also be destroyed for that GPU. This never happens because the fini_sw callbacks never get to run. Run them manually before calling amdgpu_device_ip_fini_early() when a device has already been disconnected. This location is intentionally chosen to make sure that the kfd locking refcount doesn't get incremented unintentionally. Cc: kent.russell@amd.com Closes: https://community.frame.work/t/amd-egpu-on-linux/8691/33 Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org> Reviewed-by: Kent Russell <kent.russell@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-01-08 11:43:23 -05:00
Mario Limonciello (AMD)	6b2989ac5e	Reapply "Revert "drm/amd: Skip power ungate during suspend for VPE"" Skipping power ungate exposed some scenarios that will fail like below: ``` amdgpu: Register(0) [regVPEC_QUEUE_RESET_REQ] failed to reach value 0x00000000 != 0x00000001n amdgpu 0000:c1:00.0: amdgpu: VPE queue reset failed ... amdgpu: [drm] ERROR wait_for_completion_timeout timeout! ``` The underlying s2idle issue that prompted this commit is going to be fixed in BIOS. This reverts commit `2a6c826cfe`. This was lost in the 6.19 merge so reapply it. Fixes: `2a6c826cfe` ("drm/amd: Skip power ungate during suspend for VPE") Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org> Acked-by: Alex Deucher <alexander.deucher@amd.com> Reported-by: Konstantin <answer2019@yandex.ru> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=220812 Reported-by: Matthew Schwartz <matthew.schwartz@linux.dev> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit `3925683515`)	2026-01-07 17:24:16 -05:00
Perry Yuan	0de604d035	drm/amd/pm: Disable MMIO access during SMU Mode 1 reset During Mode 1 reset, the ASIC undergoes a reset cycle and becomes temporarily inaccessible via PCIe. Any attempt to access MMIO registers during this window (e.g., from interrupt handlers or other driver threads) can result in uncompleted PCIe transactions, leading to NMI panics or system hangs. To prevent this, set the `no_hw_access` flag to true immediately after triggering the reset. This signals other driver components to skip register accesses while the device is offline. A memory barrier `smp_mb()` is added to ensure the flag update is globally visible to all cores before the driver enters the sleep/wait state. Signed-off-by: Perry Yuan <perry.yuan@amd.com> Reviewed-by: Yifan Zhang <yifan1.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit `7edb503fe4`)	2026-01-07 17:24:10 -05:00
Perry Yuan	7edb503fe4	drm/amd/pm: Disable MMIO access during SMU Mode 1 reset During Mode 1 reset, the ASIC undergoes a reset cycle and becomes temporarily inaccessible via PCIe. Any attempt to access MMIO registers during this window (e.g., from interrupt handlers or other driver threads) can result in uncompleted PCIe transactions, leading to NMI panics or system hangs. To prevent this, set the `no_hw_access` flag to true immediately after triggering the reset. This signals other driver components to skip register accesses while the device is offline. A memory barrier `smp_mb()` is added to ensure the flag update is globally visible to all cores before the driver enters the sleep/wait state. Signed-off-by: Perry Yuan <perry.yuan@amd.com> Reviewed-by: Yifan Zhang <yifan1.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-01-05 17:00:00 -05:00
Alex Deucher	b4ba5c9509	drm/amdgpu: use dma_fence_get_status() for adapter reset We need to check if the fence was signaled without an error as the per queue resets may have signalled the fence while attempting to reset the queue. Reviewed-by: Timur Kristóf <timur.kristof@gmail.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-01-05 16:59:58 -05:00
Yo-Jung Leo Lin (AMD)	5946dbe1c8	Documentation/amdgpu: Add UMA carveout details Add documentation for the uma/carveout_options and uma/carveout attributes in sysfs Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Yo-Jung Leo Lin (AMD) <Leo.Lin@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-01-05 16:59:58 -05:00
Yo-Jung Leo Lin (AMD)	19ba61ac06	drm/amdgpu: add UMA allocation interfaces to sysfs Add a uma/ directory containing two sysfs files as interfaces to inspect or change UMA carveout size. These files are: - uma/carveout_options: a read-only file listing all the available UMA allocation options and their index. - uma/carveout: a file that is both readable and writable. On read, it shows the index of the current setting. Writing a valid index into this file allows users to change the UMA carveout size to that option on the next boot. Co-developed-by: Mario Limonciello (AMD) <superm1@kernel.org> Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Yo-Jung Leo Lin (AMD) <Leo.Lin@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-01-05 16:59:58 -05:00
Mario Limonciello (AMD)	e291729873	drm/amd: Convert DRM_() to drm_() The drm_*() macros include the device which is helpful for debugging issues in multi-GPU systems. Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org> Reviewed-by: Aurabindo Pillai <aurabindo.pillai@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2026-01-05 16:59:55 -05:00

1 2 3 4 5 ...

1604 Commits