Since the err_cause register is unprivileged, we should read it from
the driver instead of using the param that came from the FW.
Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
- remove reset_sleep_ms arg from functions that don't use it.
- move the call msleep(reset_sleep_ms) from btm poll to gaudi2_hw_fini
as it is called from there already for other flow.
Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
In hw_fini callback, we use either the cpucp packet method or polling a
register. Currently we return error only in the case of cpucp packet
failure. In this patch we also return error if polling timed out.
Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Due to a firmware bug we need to increase reset poll timeout
or else we will timeout in secured environments.
Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Engines idle state can't always be verified between changes of
engine modes (e.g., stall/halt).
For example, if a CS is inflight when altering engine's mode,
idle state will return NOT idle, always.
Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
gcc with W=1 reports
drivers/gpu/drm/nouveau/nvkm/engine/fifo/gf100.c:451:1: error:
no previous prototype for ‘gf100_fifo_nonstall_block’ [-Werror=missing-prototypes]
451 | gf100_fifo_nonstall_block(struct nvkm_event *event, int type, int index)
| ^~~~~~~~~~~~~~~~~~~~~~~~~
gf100_fifo_nonstall_block is only used in gf100.c, so it should be static
Signed-off-by: Tom Rix <trix@redhat.com>
Reviewed-by: Karol Herbst <kherbst@redhat.com>
Signed-off-by: Karol Herbst <kherbst@redhat.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20230303132731.1919329-1-trix@redhat.com
This patch is used to fix following compilation issue with legacy gcc
error: ‘for’ loop initial declarations are only allowed in C99 mode
Signed-off-by: bobzhou <bob.zhou@amd.com>
Reviewed-by: Guchun Chen <guchun.chen@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Replace soc15, nv, soc21 get_rev_id callback with common
helper so we don't need to duplicate code when introduce
new asics.
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Likun Gao <Likun.Gao@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
amdgpu_ras_register_ras_block should always be invoked
by ras_sw_init, where driver needs to check ras caps
at ip level, instead of asic level.
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Stanley Yang <Stanley.Yang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
pcie_bif ras blocks needs to be initialized as early
as possible to handle fatal error detected in hw_init
phase. also align the pcie_bif ras sw_init with other
ras blocks
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Stanley Yang <Stanley.Yang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Handle case when module is unloaded (kfd_exit) before a process space
(mm_struct) is released.
v2: Fixed potential race conditions by removing all kfd_process from
the process table first, then working on releasing the resources.
v3: Fixed loop element access / synchronization. Fixed extra empty lines.
Signed-off-by: David Belanger <david.belanger@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
This version brings along the following:
- FW Release 0.0.158.0
- Fixes to HDCP, DP MST and more
- Improvements on USB4 links and more
- Code re-architecture on link.h
Reviewed-by: Aric Cyr <aric.cyr@amd.com>
Acked-by: Qingqing Zhuo <qingqing.zhuo@amd.com>
Signed-off-by: Aric Cyr <aric.cyr@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
[Why]
The log "DMUB HPD callback" is crucial to identify when DP tunneling
is been established and driver is notified of this event from DMUB.
Same log is shared for long and short hotplug event and we need to
check trailing DC debug log to distinguish between them two, making
debugging on DPIA related issues a bit more troublesome.
[How]
Clearly states in dmesg logs whether this is a long or short hotplug
event.
Reviewed-by: Hamza Mahfooz <Hamza.Mahfooz@amd.com>
Acked-by: Qingqing Zhuo <qingqing.zhuo@amd.com>
Signed-off-by: Stylon Wang <stylon.wang@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
[Why]
If disable the mmhub vm contexts(set MMVM_CONTEXTS_DISABLE to 0xffff),
driver loading failed on vf due to fence fallback timer expired on all rings.
FLR cannot reset MMVM_CONTEXTS_DISABLE.
So this vf can not be recovered anymore unless trigger a whole gpu reset.
[How]
Under SRIOV, init MMVM_CONTEXTS_DISABLE in gmc11 golden register setting.
Signed-off-by: Yifan Zha <Yifan.Zha@amd.com>
Reviewed-by: Horace Chen <Horace.Chen@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
[Why]
For dual displays where pixel rate is much higher on one display,
we may get underflow when DET is evenly allocated.
[How]
Allocate less DET segments for the lower pixel rate display and
more DET segments for the higher pixel rate display
Reviewed-by: Alvin Lee <Alvin.Lee2@amd.com>
Acked-by: Qingqing Zhuo <qingqing.zhuo@amd.com>
Signed-off-by: Samson Tam <Samson.Tam@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
[Why]
In USB4 DP tunneling, it's possible to have this scenario that
the path becomes unavailable and CM tears down the path a little bit late.
So, in this case, the HPD is high but fails to read any DPCD register.
That causes the link connection type to be set to sst.
And not all sinks are removed behind the MST branch.
[How]
Restore the link connection type if it fails to read DPCD register.
Cc: stable@vger.kernel.org
Cc: Mario Limonciello <mario.limonciello@amd.com>
Reviewed-by: Wenjing Liu <Wenjing.Liu@amd.com>
Acked-by: Qingqing Zhuo <qingqing.zhuo@amd.com>
Signed-off-by: Cruise Hung <Cruise.Hung@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
The src/dst_pitch got mixed up during the rework of the function, make
sure the offset's refer to the correct one.
Spotted by clang:
Clang warns (or errors with CONFIG_WERROR):
drivers/gpu/drm/vmwgfx/vmwgfx_stdu.c:509:29: error: variable 'dst_pitch' is uninitialized when used here [-Werror,-Wuninitialized]
src_offset = ddirty->top * dst_pitch + ddirty->left * stdu->cpp;
^~~~~~~~~
drivers/gpu/drm/vmwgfx/vmwgfx_stdu.c:492:26: note: initialize the variable 'dst_pitch' to silence this warning
s32 src_pitch, dst_pitch;
^
= 0
1 error generated.
Signed-off-by: Zack Rusin <zackr@vmware.com>
Reported-by: Nathan Chancellor <nathan@kernel.org>
Reported-by: Dave Airlie <airlied@gmail.com>
Link: https://github.com/ClangBuiltLinux/linux/issues/1811
Fixes: 39985eea5a ("drm/vmwgfx: Abstract placement selection")
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Reviewed-by: Martin Krastev <krastevm@vmware.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20230314211445.1363828-1-zack@kde.org
pci_enable_pcie_error_reporting() enables the device to send ERR_*
Messages. Since
commit f26e58bf6f ("PCI/AER: Enable error reporting when AER is native"),
the PCI core does this for all devices during enumeration, so the
driver doesn't need to do it itself.
Remove the redundant pci_enable_pcie_error_reporting() call from the
driver. Also remove the corresponding pci_disable_pcie_error_reporting()
from the driver .remove() path.
Note that this only controls ERR_* Messages from the device. An ERR_*
Message may cause the Root Port to generate an interrupt, depending on the
AER Root Error Command register managed by the AER service driver.
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
The memory manager IDR is currently destroyed when user releases the
file descriptor.
However, at this point the user context might be still held, and memory
buffers might be still in use.
Later on, calls to release those buffers will fail due to not finding
their handles in the IDR, leading to a memory leak.
To avoid this leak, split the IDR destruction from the memory manager
fini, and postpone it to hpriv_release() when there is no user context
and no buffers are used.
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
We plan to do soft-reset either by mmio or by using cpucp packet
depending on the FW version. We don't want to check FW version in two
different places for that (execute soft-reset and wait to soft-reset)
so move the waiting to gaudi2_execute_soft_reset. This also makes sense
because the cpucp also does the waiting.
Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
The user might want to stall/resume engines to perform power testing
for various scenarios. Because our current
HL_CS_FLAGS_ENGINE_CORE_COMMAND command only handles the engines' cores,
we need to add another opcode for handling entire engine and not just
its core.
The user supplies an array, where each entry holds the engine's ID and
the command to send to the engine. The size of the array is limited
by the number of engines in the ASIC (only Gaudi2 is currently
supported).
Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
compose_device_in_use_info() was added to handle the snprintf() return
value in a single place.
However, the buffer size in print_device_in_use_info() is set such that
it would be enough for the max possible print, so
compose_device_in_use_info() is not really needed.
Moreover, scnprintf() can be used instead of snprintf(), to save the
check if the return value larger than the given size.
Cc: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
There are two reasons why mutex is better here:
1. There's a critical section relatively long, where in
certain scenarios (e.g., multiple VM allocations) taking a spinlock
might cause noticeable performance degradation.
2. It will remove the incorrect usage of mutex under
spin_lock (where preemption is disabled).
Reported-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
The PDMA/EDMA is_idle routines didn't check the correct CORE register
in order to get the accurate idle state.
Moreover, it's better to make the is_idle routine more robust by adding
additional checks (IS_HALTED) before announcing that the core is idle.
Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Is appears that the flag -
DCORE0_TPC0_CFG_STATUS_VECTOR_PIPE_EMPTY_MASK, has no actual use when
it comes to querying TPC idleness, since this flag's corresponding bit
turns-off after stalling the engine, and turns back on after resuming
it.
Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
In case the KDMA fails scrubbing the DCCMs (following a soft-reset
upon device release), the driver will only print failure until reset
flow ends, rather than escalating it into a hard-reset.
Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Since hw_fini return error code for failure indication, we should
check its return value. Currently it might only fail upon soft-reset
from hl_device_reset. Later patch will add hw_fini failure in case of
polling timeout in hard-reset.
Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
is_idle() was too long, so break it up for readability.
In addition, we can now use the new sub-routines from other places.
Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>