There is no reason to require this to happen on first submitted IB only.
We need to wait for the queue to be idle, but it can be done at any
time (including when there are multiple video sessions active).
Signed-off-by: David Rosca <david.rosca@amd.com>
Reviewed-by: Leo Liu <leo.liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 8908fdce06)
Cc: stable@vger.kernel.org
There can be multiple engine info packages in one IB and the first one
may be common engine, not decode/encode.
We need to parse the entire IB instead of stopping after finding first
engine info.
Signed-off-by: David Rosca <david.rosca@amd.com>
Reviewed-by: Leo Liu <leo.liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit dc8f9f0f45)
Cc: stable@vger.kernel.org
Commit b61badd20b ("drm/amdgpu: fix usage slab after free")
reordered when amdgpu_fence_driver_sw_fini() was called after
that patch, amdgpu_fence_driver_sw_fini() effectively became
a no-op as the sched entities we never freed because the
ring pointers were already set to NULL. Remove the NULL
setting.
Reported-by: Lin.Cao <lincao12@amd.com>
Cc: Vitaly Prosyak <vitaly.prosyak@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Fixes: b61badd20b ("drm/amdgpu: fix usage slab after free")
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit a525fa37aa)
Cc: stable@vger.kernel.org
When the xe pm_notifier evicts for suspend / hibernate, there might be
racing tasks trying to re-validate again. This can lead to suspend taking
excessive time or get stuck in a live-lock. This behaviour becomes
much worse with the fix that actually makes re-validation bring back
bos to VRAM rather than letting them remain in TT.
Prevent that by having exec and the rebind worker waiting for a completion
that is set to block by the pm_notifier before suspend and is signaled
by the pm_notifier after resume / wakeup.
It's probably still possible to craft malicious applications that block
suspending. More work is pending to fix that.
v3:
- Avoid wait_for_completion() in the kernel worker since it could
potentially cause work item flushes from freezable processes to
wait forever. Instead terminate the rebind workers if needed and
re-launch at resume. (Matt Auld)
v4:
- Fix some bad naming and leftover debug printouts.
- Fix kerneldoc.
- Use drmm_mutex_init() for the xe->rebind_resume_lock (Matt Auld).
- Rework the interface of xe_vm_rebind_resume_worker (Matt Auld).
Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/4288
Fixes: c6a4d46ec1 ("drm/xe: evict user memory in PM notifier")
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: <stable@vger.kernel.org> # v6.16+
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Link: https://lore.kernel.org/r/20250904160715.2613-4-thomas.hellstrom@linux.intel.com
(cherry picked from commit 599334572a)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
VRAM+TT bos that are evicted from VRAM to TT may remain in
TT also after a revalidation following eviction or suspend.
This manifests itself as applications becoming sluggish
after buffer objects get evicted or after a resume from
suspend or hibernation.
If the bo supports placement in both VRAM and TT, and
we are on DGFX, mark the TT placement as fallback. This means
that it is tried only after VRAM + eviction.
This flaw has probably been present since the xe module was
upstreamed but use a Fixes: commit below where backporting is
likely to be simple. For earlier versions we need to open-
code the fallback algorithm in the driver.
v2:
- Remove check for dgfx. (Matthew Auld)
- Update the xe_dma_buf kunit test for the new strategy (CI)
- Allow dma-buf to pin in current placement (CI)
- Make xe_bo_validate() for pinned bos a NOP.
Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/5995
Fixes: a78a8da51b ("drm/ttm: replace busy placement with flags v6")
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: <stable@vger.kernel.org> # v6.9+
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Link: https://lore.kernel.org/r/20250904160715.2613-2-thomas.hellstrom@linux.intel.com
(cherry picked from commit cb3d7b3b46)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Current mem limit check leaks some GTT memory (reserved_for_pt
reserved_for_ras + adev->vram_pin_size) for small APUs.
Since carveout VRAM is tunable on APUs, there are three case
regarding the carveout VRAM size relative to GTT:
1. 0 < carveout < gtt
apu_prefer_gtt = true, is_app_apu = false
2. carveout > gtt / 2
apu_prefer_gtt = false, is_app_apu = false
3. 0 = carveout
apu_prefer_gtt = true, is_app_apu = true
It doesn't make sense to check below limitation in case 1
(default case, small carveout) because the values in the below
expression are mixed with carveout and gtt.
adev->kfd.vram_used[xcp_id] + vram_needed >
vram_size - reserved_for_pt - reserved_for_ras -
atomic64_read(&adev->vram_pin_size)
gtt: kfd.vram_used, vram_needed, vram_size
carveout: reserved_for_pt, reserved_for_ras, adev->vram_pin_size
In case 1, vram allocation will go to gtt domain, skip vram check
since ttm_mem_limit check already cover this allocation.
Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com>
Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit fa7c99f04f)
When creating p2p links, KFD needs to check XGMI link
with two conditions, hive_id and is_sharing_enabled,
but it is missing to check is_sharing_enabled, so add
it to fix the error.
Signed-off-by: Eric Huang <jinhuieric.huang@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 36cc7d1317)
Some PSPv11 SOCs take a longer time for PSP based mode-1 reset. Instead
of checking for C2PMSG_33 status, add the callback wait_for_bootloader.
Wait for bootloader to be back to steady state is already part of the
generic mode-1 reset flow. Increase the retry count for bootloader wait
and also fix the mask to prevent fake pass.
Fixes: 8345a71fc5 ("drm/amdgpu: Add more checks to PSP mailbox")
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4531
Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 32f73741d6)
This reverts:
commit bead880022 ("drm/nouveau: Remove waitque for sched teardown")
commit 5f46f5c7af ("drm/nouveau: Add new callback for scheduler teardown")
from the drm/sched teardown leak fix series:
https://lore.kernel.org/dri-devel/20250710125412.128476-2-phasta@kernel.org/
The aforementioned series removed a blocking waitqueue from
nouveau_sched_fini(). It was mistakenly assumed that this waitqueue only
prevents jobs from leaking, which the series fixed.
The waitqueue, however, also guarantees that all VM_BIND related jobs
are finished in order, cleaning up mappings in the GPU's MMU. These jobs
must be executed sequentially. Without the waitqueue, this is no longer
guaranteed, because entity and scheduler teardown can race with each
other.
Revert all patches related to the waitqueue removal.
Fixes: bead880022 ("drm/nouveau: Remove waitque for sched teardown")
Suggested-by: Danilo Krummrich <dakr@kernel.org>
Signed-off-by: Philipp Stanner <phasta@kernel.org>
Link: https://lore.kernel.org/r/20250901083107.10206-2-phasta@kernel.org
Signed-off-by: Danilo Krummrich <dakr@kernel.org>
Currently the kzalloc failure check just sets reports the failure
and sets the variable ret to -ENOMEM, which is not checked later
for this specific error. Fix this by just returning -ENOMEM rather
than setting ret.
Fixes: 4fb9307154 ("drm/amd/amdgpu: remove redundant host to psp cmd buf allocations")
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 1ee9d1a096)
The bridge has three bootstrap pins which are sampled to determine the
frequency of the external reference clock. The driver will also
(over)write that setting. But it seems this is racy after the bridge is
enabled. It was observed that although the driver write the correct
value (by sniffing on the I2C bus), the register has the wrong value.
The datasheet states that the GPIO lines have to be stable for at least
5us after asserting the EN signal. Thus, there seems to be some logic
which samples the GPIO lines and this logic appears to overwrite the
register value which was set by the driver. Waiting 20us after
asserting the EN line resolves this issue.
Fixes: a095f15c00 ("drm/bridge: add support for sn65dsi86 bridge driver")
Signed-off-by: Michael Walle <mwalle@kernel.org>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Link: https://lore.kernel.org/r/20250821122341.1257286-1-mwalle@kernel.org
The drm_sched_job_unschedulable trace point can access
entity->dependency after it was cleared by the callback
installed in drm_sched_entity_add_dependency_cb, causing:
BUG: kernel NULL pointer dereference, address: 0000000000000020
[...]
Workqueue: comp_1.1.0 drm_sched_run_job_work [gpu_sched]
RIP: 0010:trace_event_raw_event_drm_sched_job_unschedulable+0x70/0xd0 [gpu_sched]
To fix this we either need to keep a reference to the fence before
setting up the callbacks, or move the trace_drm_sched_job_unschedulable
calls into drm_sched_entity_add_dependency_cb where they can be
done earlier.
Fixes: 76d97c870f ("drm/sched: Trace dependencies for GPU jobs")
Signed-off-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Signed-off-by: Philipp Stanner <phasta@kernel.org>
Link: https://lore.kernel.org/r/20250901124032.1955-1-pierre-eric.pelloux-prayer@amd.com
(cherry picked from commit b2b8af21fe)
Signed-off-by: Maxime Ripard <mripard@kernel.org>
If built on architectures with CONFIG_ARCH_DMA_ADDR_T_64BIT=y nova-core
produces that following build failures:
error[E0308]: mismatched types
--> drivers/gpu/nova-core/fb.rs:49:59
|
49 | hal::fb_hal(chipset).write_sysmem_flush_page(bar, page.dma_handle())?;
| ----------------------- ^^^^^^^^^^^^^^^^^ expected `u64`, found `u32`
| |
| arguments to this method are incorrect
|
note: method defined here
--> drivers/gpu/nova-core/fb/hal.rs:19:8
|
19 | fn write_sysmem_flush_page(&self, bar: &Bar0, addr: u64) -> Result;
| ^^^^^^^^^^^^^^^^^^^^^^^
help: you can convert a `u32` to a `u64`
|
49 | hal::fb_hal(chipset).write_sysmem_flush_page(bar, page.dma_handle().into())?;
| +++++++
error[E0308]: mismatched types
--> drivers/gpu/nova-core/fb.rs:65:47
|
65 | if hal.read_sysmem_flush_page(bar) == self.page.dma_handle() {
| ------------------------------- ^^^^^^^^^^^^^^^^^^^^^^ expected `u64`, found `u32`
| |
| expected because this is `u64`
|
help: you can convert a `u32` to a `u64`
|
65 | if hal.read_sysmem_flush_page(bar) == self.page.dma_handle().into() {
| +++++++
error: this arithmetic operation will overflow
--> drivers/gpu/nova-core/falcon.rs:469:23
|
469 | .set_base((dma_start >> 40) as u16)
| ^^^^^^^^^^^^^^^^^ attempt to shift right by `40_i32`, which would overflow
|
= note: `#[deny(arithmetic_overflow)]` on by default
This is due to the code making assumptions on the width of dma_addr_t to
be 64 bit.
While this could technically be handled, it is rather painful to deal
with, as the following example illustrates:
pub(super) fn read_sysmem_flush_page_ga100(bar: &Bar0) -> DmaAddress {
let addr = u64::from(regs::NV_PFB_NISO_FLUSH_SYSMEM_ADDR::read(bar).adr_39_08())
<< FLUSH_SYSMEM_ADDR_SHIFT
| u64::from(regs::NV_PFB_NISO_FLUSH_SYSMEM_ADDR_HI::read(bar).adr_63_40())
<< FLUSH_SYSMEM_ADDR_SHIFT_HI;
addr.try_into().unwrap_or_else(|_| {
kernel::warn_on!(true);
0
})
}
At the same time there's not much value for nova-core to support 32-bit,
given that the supported GPU architectures are Turing and later, hence
depend on CONFIG_64BIT.
Cc: John Hubbard <jhubbard@nvidia.com>
Reported-by: Miguel Ojeda <ojeda@kernel.org>
Closes: https://lore.kernel.org/lkml/20250828160247.37492-1-ojeda@kernel.org/
Fixes: 6554ad65b5 ("gpu: nova-core: register sysmem flush page")
Fixes: 69f5cd67ce ("gpu: nova-core: add falcon register definitions and base code")
Reviewed-by: Alexandre Courbot <acourbot@nvidia.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Link: https://lore.kernel.org/r/20250828223954.351348-1-dakr@kernel.org
Signed-off-by: Danilo Krummrich <dakr@kernel.org>
Nouveau has code that when it gets an IRQ with no allowed handler
it disables it to avoid storms.
However with nonstall interrupts, we often disable them from
the drm driver, but still request their emission via the push submission.
Just don't disable nonstall irqs ever in normal operation, the
event handling code will filter them out, and the driver will
just enable/disable them at load time.
This fixes timeouts we've been seeing on/off for a long time,
but they became a lot more noticeable on Blackwell.
This doesn't fix all of them, there is a subsequent fence emission
fix to fix the last few.
Fixes: 3ebd64aa3c ("drm/nouveau/intr: support multiple trees, and explicit interfaces")
Cc: stable@vger.kernel.org
Signed-off-by: Dave Airlie <airlied@redhat.com>
Link: https://lore.kernel.org/r/20250829021633.1674524-1-airlied@gmail.com
[ Fix a typo and a minor checkpatch.pl warning; remove "v2" from commit
subject. - Danilo ]
Signed-off-by: Danilo Krummrich <dakr@kernel.org>
Using the previous firmware could lead to problems with
PROTECTED_FENCE_SIGNAL commands, specifically causing register
conflicts between MCU_DBG0 and MCU_DBG1.
The updated firmware versions ensure proper alignment
and unification of the SDMA_SUBOP_PROTECTED_FENCE_SIGNAL value with SDMA 7.x,
resolving these hardware coordination issues
Fixes: e8cca30d8b ("drm/amdgpu/sdma6: add ucode version checks for userq support")
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit aab8b689ad)
Cc: stable@vger.kernel.org
Mediatek DRM Fixes - 20250829
1. Add error handling for old state CRTC in atomic_disable
2. Fix DSI host and panel bridge pre-enable order
3. Fix device/node reference count leaks in mtk_drm_get_all_drm_priv
4. mtk_hdmi: Fix inverted parameters in some regmap_update_bits calls
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Chun-Kuang Hu <chunkuang.hu@kernel.org>
Link: https://lore.kernel.org/r/20250828234116.4960-1-chunkuang.hu@kernel.org
The minimum firmware versions required for user queue functionality
have been increased to address an issue where the queue privilege
state was lost during queue connect operations.
The problem occurred because the privilege state was being restored
to its initial value at the beginning of the function, overwriting
the state that was properly set during the queue connect case.
This commit updates the minimum version requirements:
- ME firmware from 2390 to 2420
- PFP firmware from 2530 to 2580
- MEC firmware from 2600 to 2650
- MES firmware remains at 120
These updated firmware versions contain the necessary fixes to
properly maintain queue privilege state throughout connect operations.
Fixes: 61ca97e959 ("drm/amdgpu: Add fw minimum version check for usermode queue")
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 5f976c9939)
Cc: stable@vger.kernel.org