Commit Graph

1170202 Commits

Author SHA1 Message Date
Oded Gabbay
60d7bbb5b4 accel/habanalabs: fix field names in hl_info_hw_ip_info
Don't use padX for actual reservedX fields.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Ofir Bitton <obitton@habana.ai>
2023-03-20 17:35:30 +02:00
Dafna Hirschfeld
5d8a5f2965 accel/habanalabs: in {e/p}dma_core events read the err cause reg
Since the err_cause register is unprivileged, we should read it from
the driver instead of using the param that came from the FW.

Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-03-20 17:35:30 +02:00
Dafna Hirschfeld
f8d139a71b accel/habanalabs: fix use of var reset_sleep_ms
- remove reset_sleep_ms arg from functions that don't use it.
- move the call msleep(reset_sleep_ms) from btm poll to gaudi2_hw_fini
as it is called from there already for other flow.

Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-03-20 17:35:30 +02:00
Dafna Hirschfeld
077a39fabe accel/habanalabs: in hw_fini return error code if polling timed-out
In hw_fini callback, we use either the cpucp packet method or polling a
register. Currently we return error only in the case of cpucp packet
failure. In this patch we also return error if polling timed out.

Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-03-20 17:35:29 +02:00
Ofir Bitton
8c695455ee accel/habanalabs: increase reset poll timeout
Due to a firmware bug we need to increase reset poll timeout
or else we will timeout in secured environments.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-03-20 17:35:29 +02:00
Koby Elbaz
7c766e58cc accel/habanalabs: do not verify engine modes after being changed
Engines idle state can't always be verified between changes of
engine modes (e.g., stall/halt).
For example, if a CS is inflight when altering engine's mode,
idle state will return NOT idle, always.

Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-03-20 17:35:29 +02:00
Oded Gabbay
336b78c655 accel/habanalabs: align to latest firmware specs
Copy the most up-to-date interface files to the firmware.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Ofir Bitton <obitton@habana.ai>
2023-03-20 17:35:25 +02:00
Colin Ian King
443355d2ff accel/habanalabs: Fix spelling mistake "maped" -> "mapped"
There is a spelling mistake in a dev_err message. Fix it.

Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-03-20 17:35:25 +02:00
Oded Gabbay
79c164372e accel/habanalabs: make gaudi2_is_device_idle() static
This function is only called inside gaudi2.c file.

Reported-by: kernel test robot <lkp@intel.com>
Link: https://lore.kernel.org/oe-kbuild-all/202303071320.X5ouBlNY-lkp@intel.com/
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Ofir Bitton <obitton@habana.ai>
2023-03-20 17:31:02 +02:00
Dave Airlie
90031bc33f Merge tag 'amd-drm-next-6.4-2023-03-17' of https://gitlab.freedesktop.org/agd5f/linux into drm-next
amd-drm-next-6.4-2023-03-17:

amdgpu:
- Misc code cleanups
- Documentation fixes
- Make kobj structures const
- Add thermal throttling adjustments for supported APUs
- UMC RAS fixes
- Display reset fixes
- DCN 3.2 fixes
- Freesync fixes
- DC code reorg
- Generalize dmabuf import to work with KFD
- DC DML fixes
- SRIOV fixes
- UVD code cleanups
- IH 4.4.2 updates
- HDP 4.4.2 updates
- SDMA 4.4.2 updates
- PSP 13.0.6 updates
- Add capped/uncapped workload handling for supported APUs
- DCN 3.1.4 updates
- Re-org DC Kconfig
- USB4 fixes
- Reorg DC plane and stream handling
- Register vga_switcheroo for apple-gmux
- SMU 13.0.6 updates
- Fix error checking in read_mm_registers functions for affected families
- VCN 4.0.4 fix
- Drop redundant pci_enable_pcie_error_reporting() call
- RDNA2 SMU OD suspend/resume fix
- Expose additional memory stats via fdinfo
- RAS fixes
- Misc display fixes
- DP MST fixes
- IOMMU regression fix for KFD

amdkfd:
- Make kobj structures const
- Support for exporting buffers via dmabuf
- Multi-VMA page migration fixes
- NBIO fixes
- Misc code cleanups
- Fix possible double free
- Fix possible UAF

radeon:
- iMac fix

UAPI:
- KFD dmabuf export support.  Required for importing KFD buffers into GEM contexts and for RDMA P2P support.
  Proposed user mode changes: https://github.com/fxkamd/ROCT-Thunk-Interface/commits/fxkamd/dmabuf

From: Alex Deucher <alexander.deucher@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20230317164416.138340-1-alexander.deucher@amd.com
2023-03-20 16:44:36 +10:00
Tom Rix
b24343eace drm/nouveau/nvfw/acr: set wpr_generic_header_dump storage-class-specifier to static
gcc with W=1 reports
drivers/gpu/drm/nouveau/nvkm/nvfw/acr.c:49:1: error: no previous
  prototype for ‘wpr_generic_header_dump’ [-Werror=missing-prototypes]
   49 | wpr_generic_header_dump(struct nvkm_subdev *subdev,
      | ^~~~~~~~~~~~~~~~~~~~~~~

wpr_generic_header_dump is only used in acr.c, so it should be static

Signed-off-by: Tom Rix <trix@redhat.com>
Reviewed-by: Karol Herbst <kherbst@redhat.com>
Signed-off-by: Karol Herbst <kherbst@redhat.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20230302124819.686469-1-trix@redhat.com
2023-03-16 14:53:15 +01:00
Tom Rix
c14bff92ab drm/nouveau/fifo: set nvkm_engn_cgrp_get storage-class-specifier to static
smatch reports
drivers/gpu/drm/nouveau/nvkm/engine/fifo/runl.c:33:18:
  warning: symbol 'nvkm_engn_cgrp_get' was not declared. Should it be static?

nvkm_engn_cgrp_get is only used in runl.c, so it should be static

Signed-off-by: Tom Rix <trix@redhat.com>
Reviewed-by: Karol Herbst <kherbst@redhat.com>
Signed-off-by: Karol Herbst <kherbst@redhat.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20230228221533.3240520-1-trix@redhat.com
2023-03-16 14:53:15 +01:00
Tom Rix
abe3c66f34 drm/nouveau/fifo: set gf100_fifo_nonstall_block_dump storage-class-specifier to static
gcc with W=1 reports
drivers/gpu/drm/nouveau/nvkm/engine/fifo/gf100.c:451:1: error:
  no previous prototype for ‘gf100_fifo_nonstall_block’ [-Werror=missing-prototypes]
  451 | gf100_fifo_nonstall_block(struct nvkm_event *event, int type, int index)
      | ^~~~~~~~~~~~~~~~~~~~~~~~~

gf100_fifo_nonstall_block is only used in gf100.c, so it should be static

Signed-off-by: Tom Rix <trix@redhat.com>
Reviewed-by: Karol Herbst <kherbst@redhat.com>
Signed-off-by: Karol Herbst <kherbst@redhat.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20230303132731.1919329-1-trix@redhat.com
2023-03-16 14:53:15 +01:00
Felix Kuehling
7ee938ac00 drm/amdgpu: Don't resume IOMMU after incomplete init
Check kfd->init_complete in kgd2kfd_iommu_resume, consistent with other
kgd2kfd calls. This should fix IOMMU errors on resume from suspend when
KFD IOMMU initialization failed.

Reported-by: Matt Fagnani <matt.fagnani@bell.net>
Link: https://lore.kernel.org/r/4a3b225c-2ffd-e758-4de1-447375e34cad@bell.net/
Link: https://bugzilla.kernel.org/show_bug.cgi?id=217170
Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2454
Cc: Vasant Hegde <vasant.hegde@amd.com>
Cc: Linux regression tracking (Thorsten Leemhuis) <regressions@leemhuis.info>
Cc: stable@vger.kernel.org
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Tested-by: Matt Fagnani <matt.fagnani@bell.net>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-03-15 18:45:27 -04:00
bobzhou
0bad3200df drm/amd: fix compilation issue with legacy gcc
This patch is used to fix following compilation issue with legacy gcc

error: ‘for’ loop initial declarations are only allowed in C99 mode

Signed-off-by: bobzhou <bob.zhou@amd.com>
Reviewed-by: Guchun Chen <guchun.chen@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-03-15 18:45:27 -04:00
Hawking Zhang
4489f0fd9e drm/amdgpu: Retire pcie_gen3_enable function
Not needed since from vi. drop the function so
we don't duplicate code when introduce new asics.

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Likun Gao <Likun.Gao@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-03-15 18:45:27 -04:00
Hawking Zhang
dabc114e4b drm/amdgpu: Move to common helper to query soc rev_id
Replace soc15, nv, soc21 get_rev_id callback with common
helper so we don't need to duplicate code when introduce
new asics.

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Likun Gao <Likun.Gao@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-03-15 18:45:27 -04:00
Hawking Zhang
65ba96e91b drm/amdgpu: Move to common indirect reg access helper
Replace soc15, nv, soc21 specific callbacks with common
one. so we don't need to duplicate code when introduce
new asics.

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Likun Gao <Likun.Gao@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-03-15 18:45:27 -04:00
Hawking Zhang
370808876b drm/amdgpu: drop ras check at asic level for new blocks
amdgpu_ras_register_ras_block should always be invoked
by ras_sw_init, where driver needs to check ras caps
at ip level, instead of asic level.

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Stanley Yang <Stanley.Yang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-03-15 18:45:27 -04:00
Hawking Zhang
fdc94d3a8c drm/amdgpu: Rework pcie_bif ras sw_init
pcie_bif ras blocks needs to be initialized as early
as possible to handle fatal error detected in hw_init
phase. also align the pcie_bif ras sw_init with other
ras blocks

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Stanley Yang <Stanley.Yang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-03-15 18:45:27 -04:00
Hawking Zhang
da9d669eab drm/amdgpu: Rework xgmi_wafl_pcs ras sw_init
To align with other IP blocks.

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Stanley Yang <Stanley.Yang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-03-15 18:45:27 -04:00
Hawking Zhang
7f544c5488 drm/amdgpu: Rework mca ras sw_init
To align with other IP blocks

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Stanley Yang <Stanley.Yang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-03-15 18:45:26 -04:00
David Belanger
22e3d9343b drm/amdkfd: Fixed kfd_process cleanup on module exit.
Handle case when module is unloaded (kfd_exit) before a process space
(mm_struct) is released.

v2: Fixed potential race conditions by removing all kfd_process from
the process table first, then working on releasing the resources.

v3: Fixed loop element access / synchronization.  Fixed extra empty lines.

Signed-off-by: David Belanger <david.belanger@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-03-15 18:45:26 -04:00
Aric Cyr
3726b6e7c0 drm/amd/display: 3.2.227
This version brings along the following:
- FW Release 0.0.158.0
- Fixes to HDCP, DP MST and more
- Improvements on USB4 links and more
- Code re-architecture on link.h

Reviewed-by: Aric Cyr <aric.cyr@amd.com>
Acked-by: Qingqing Zhuo <qingqing.zhuo@amd.com>
Signed-off-by: Aric Cyr <aric.cyr@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-03-15 18:45:26 -04:00
Anthony Koo
c79503dc2e drm/amd/display: [FW Promotion] Release 0.0.158.0
[Why & How]
Add boot control bit to control dispclk and dppclk deep sleep

Acked-by: Qingqing Zhuo <qingqing.zhuo@amd.com>
Signed-off-by: Anthony Koo <Anthony.Koo@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-03-15 18:45:26 -04:00
Samson Tam
05cff51055 drm/amd/display: fix assert condition
[Why & How]
Reversed assert condition when checking that phy_pix_clk[] is not 0

Reviewed-by: Alvin Lee <Alvin.Lee2@amd.com>
Acked-by: Qingqing Zhuo <qingqing.zhuo@amd.com>
Signed-off-by: Samson Tam <Samson.Tam@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-03-15 18:45:26 -04:00
Stylon Wang
c416a9e4e3 drm/amd/display: Clearly states if long or short HPD event in dmesg logs
[Why]
The log "DMUB HPD callback" is crucial to identify when DP tunneling
is been established and driver is notified of this event from DMUB.
Same log is shared for long and short hotplug event and we need to
check trailing DC debug log to distinguish between them two, making
debugging on DPIA related issues a bit more troublesome.

[How]
Clearly states in dmesg logs whether this is a long or short hotplug
event.

Reviewed-by: Hamza Mahfooz <Hamza.Mahfooz@amd.com>
Acked-by: Qingqing Zhuo <qingqing.zhuo@amd.com>
Signed-off-by: Stylon Wang <stylon.wang@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-03-15 18:45:26 -04:00
Wesley Chalmers
7108a1c127 drm/amd/display: Make DCN32 functions available to future DCNs
[Why & How]
Make DCN32 functions available for more DCNs.

Reviewed-by: Chris Park <Chris.Park@amd.com>
Acked-by: Qingqing Zhuo <qingqing.zhuo@amd.com>
Signed-off-by: Wesley Chalmers <Wesley.Chalmers@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-03-15 18:45:26 -04:00
Yifan Zha
057e335c71 drm/amdgpu: Init MMVM_CONTEXTS_DISABLE in gmc11 golden setting under SRIOV
[Why]
If disable the mmhub vm contexts(set MMVM_CONTEXTS_DISABLE to 0xffff),
driver loading failed on vf due to fence fallback timer expired on all rings.
FLR cannot reset MMVM_CONTEXTS_DISABLE.
So this vf can not be recovered anymore unless trigger a whole gpu reset.

[How]
Under SRIOV, init MMVM_CONTEXTS_DISABLE in gmc11 golden register setting.

Signed-off-by: Yifan Zha <Yifan.Zha@amd.com>
Reviewed-by: Horace Chen <Horace.Chen@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-03-15 18:45:26 -04:00
Samson Tam
5f3401eeb0 drm/amd/display: reallocate DET for dual displays with high pixel rate ratio
[Why]
For dual displays where pixel rate is much higher on one display,
we may get underflow when DET is evenly allocated.

[How]
Allocate less DET segments for the lower pixel rate display and
more DET segments for the higher pixel rate display

Reviewed-by: Alvin Lee <Alvin.Lee2@amd.com>
Acked-by: Qingqing Zhuo <qingqing.zhuo@amd.com>
Signed-off-by: Samson Tam <Samson.Tam@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-03-15 18:45:26 -04:00
Ayush Gupta
562e08223a drm/amd/display: disconnect MPCC only on OTG change
[Why]
Framedrops are observed while playing Vp9 and Av1 10 bit
video on 8k resolution using VSR while playback controls
are disappeared/appeared

[How]
Now ODM 2 to 1 is disabled for 5k or greater resolutions on VSR.

Cc: stable@vger.kernel.org
Cc: Mario Limonciello <mario.limonciello@amd.com>
Reviewed-by: Alvin Lee <Alvin.Lee2@amd.com>
Acked-by: Qingqing Zhuo <qingqing.zhuo@amd.com>
Signed-off-by: Ayush Gupta <ayugupta@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-03-15 18:45:26 -04:00
Cruise Hung
deaccddaf4 drm/amd/display: Fix DP MST sinks removal issue
[Why]
In USB4 DP tunneling, it's possible to have this scenario that
the path becomes unavailable and CM tears down the path a little bit late.
So, in this case, the HPD is high but fails to read any DPCD register.
That causes the link connection type to be set to sst.
And not all sinks are removed behind the MST branch.

[How]
Restore the link connection type if it fails to read DPCD register.

Cc: stable@vger.kernel.org
Cc: Mario Limonciello <mario.limonciello@amd.com>
Reviewed-by: Wenjing Liu <Wenjing.Liu@amd.com>
Acked-by: Qingqing Zhuo <qingqing.zhuo@amd.com>
Signed-off-by: Cruise Hung <Cruise.Hung@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-03-15 18:45:26 -04:00
Robin Chen
eeefe7c482 drm/amd/display: hpd rx irq not working with eDP interface
[Why]
This is the fix for the defect of commit ab144f0b4a
("drm/amd/display: Allow individual control of eDP hotplug support").

[How]
To revise the default eDP hotplug setting and use the enum to git rid
of the magic number for different options.

Fixes: ab144f0b4a ("drm/amd/display: Allow individual control of eDP hotplug support")
Cc: stable@vger.kernel.org
Cc: Mario Limonciello <mario.limonciello@amd.com>
Reviewed-by: Wenjing Liu <Wenjing.Liu@amd.com>
Acked-by: Qingqing Zhuo <qingqing.zhuo@amd.com>
Signed-off-by: Robin Chen <robin.chen@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-03-15 18:45:13 -04:00
Zack Rusin
328839ff93 drm/vmwgfx: Fix src/dst_pitch confusion
The src/dst_pitch got mixed up during the rework of the function, make
sure the offset's refer to the correct one.

Spotted by clang:
Clang warns (or errors with CONFIG_WERROR):

  drivers/gpu/drm/vmwgfx/vmwgfx_stdu.c:509:29: error: variable 'dst_pitch' is uninitialized when used here [-Werror,-Wuninitialized]
          src_offset = ddirty->top * dst_pitch + ddirty->left * stdu->cpp;
                                     ^~~~~~~~~
  drivers/gpu/drm/vmwgfx/vmwgfx_stdu.c:492:26: note: initialize the variable 'dst_pitch' to silence this warning
          s32 src_pitch, dst_pitch;
                                  ^
                                   = 0
  1 error generated.

Signed-off-by: Zack Rusin <zackr@vmware.com>
Reported-by: Nathan Chancellor <nathan@kernel.org>
Reported-by: Dave Airlie <airlied@gmail.com>
Link: https://github.com/ClangBuiltLinux/linux/issues/1811
Fixes: 39985eea5a ("drm/vmwgfx: Abstract placement selection")
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Reviewed-by: Martin Krastev <krastevm@vmware.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20230314211445.1363828-1-zack@kde.org
2023-03-15 16:15:25 -04:00
Tvrtko Ursulin
4230cea89c drm: Track clients by tgid and not tid
Thread group id (aka pid from userspace point of view) is a more
interesting thing to show as an owner of a DRM fd, so track and show that
instead of the thread id.

In the next patch we will make the owner updated post file descriptor
handover, which will also be tgid based to avoid ping-pong when multiple
threads access the fd.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Zack Rusin <zackr@vmware.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Christian König <christian.koenig@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20230314141904.1210824-2-tvrtko.ursulin@linux.intel.com
2023-03-15 14:03:00 +01:00
Bjorn Helgaas
4516fede7c accel/habanalabs: Drop redundant pci_enable_pcie_error_reporting()
pci_enable_pcie_error_reporting() enables the device to send ERR_*
Messages. Since
commit f26e58bf6f ("PCI/AER: Enable error reporting when AER is native"),
the PCI core does this for all devices during enumeration, so the
driver doesn't need to do it itself.

Remove the redundant pci_enable_pcie_error_reporting() call from the
driver.  Also remove the corresponding pci_disable_pcie_error_reporting()
from the driver .remove() path.

Note that this only controls ERR_* Messages from the device.  An ERR_*
Message may cause the Root Port to generate an interrupt, depending on the
AER Root Error Command register managed by the AER service driver.

Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-03-15 13:29:16 +02:00
Tomer Tayar
2e8e9a895c accel/habanalabs: postpone mem_mgr IDR destruction to hpriv_release()
The memory manager IDR is currently destroyed when user releases the
file descriptor.
However, at this point the user context might be still held, and memory
buffers might be still in use.
Later on, calls to release those buffers will fail due to not finding
their handles in the IDR, leading to a memory leak.
To avoid this leak, split the IDR destruction from the memory manager
fini, and postpone it to hpriv_release() when there is no user context
and no buffers are used.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-03-15 13:29:15 +02:00
Dafna Hirschfeld
801507d3b3 accel/habanalabs: move soft-reset wait to soft-reset execute
We plan to do soft-reset either by mmio or by using cpucp packet
depending on the FW version. We don't want to check FW version in two
different places for that (execute soft-reset and wait to soft-reset)
so move the waiting to gaudi2_execute_soft_reset. This also makes sense
because the cpucp also does the waiting.

Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-03-15 13:29:15 +02:00
Koby Elbaz
f7f0085eec accel/habanalabs: add uapi to stall/resume engine
The user might want to stall/resume engines to perform power testing
for various scenarios. Because our current
HL_CS_FLAGS_ENGINE_CORE_COMMAND command only handles the engines' cores,
we need to add another opcode for handling entire engine and not just
its core.

The user supplies an array, where each entry holds the engine's ID and
the command to send to the engine. The size of the array is limited
by the number of engines in the ASIC (only Gaudi2 is currently
supported).

Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-03-15 13:29:15 +02:00
Tomer Tayar
28fbc058f2 accel/habanalabs: use scnprintf() in print_device_in_use_info()
compose_device_in_use_info() was added to handle the snprintf() return
value in a single place.
However, the buffer size in print_device_in_use_info() is set such that
it would be enough for the max possible print, so
compose_device_in_use_info() is not really needed.
Moreover, scnprintf() can be used instead of snprintf(), to save the
check if the return value larger than the given size.

Cc: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
2023-03-15 13:29:15 +02:00
Dafna Hirschfeld
087fe7c9c2 accel/habanalabs: unify err log of hw-fini failure in dirty state
print more informative message when failing in dirty state

Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
2023-03-15 13:29:15 +02:00
Koby Elbaz
7ffb5ced2b accel/habanalabs: use a mutex rather than a spinlock
There are two reasons why mutex is better here:
1. There's a critical section relatively long, where in
certain scenarios (e.g., multiple VM allocations) taking a spinlock
might cause noticeable performance degradation.
2. It will remove the incorrect usage of mutex under
spin_lock (where preemption is disabled).

Reported-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-03-15 13:29:15 +02:00
Dafna Hirschfeld
c7ac65c881 accel/habanalabs: allow getting HL_INFO_DRAM_USAGE during soft-reset
We can allow userspace to query the dram usage during soft-reset.

Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-03-15 13:29:15 +02:00
Koby Elbaz
9732d5d0d4 accel/habanalabs: fix register address on PDMA/EDMA idle check
The PDMA/EDMA is_idle routines didn't check the correct CORE register
in order to get the accurate idle state.
Moreover, it's better to make the is_idle routine more robust by adding
additional checks (IS_HALTED) before announcing that the core is idle.

Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-03-15 13:29:15 +02:00
Koby Elbaz
f831aade13 accel/habanalabs: remove a useless is_idle TPC flag
Is appears that the flag -
DCORE0_TPC0_CFG_STATUS_VECTOR_PIPE_EMPTY_MASK, has no actual use when
it comes to querying TPC idleness, since this flag's corresponding bit
turns-off after stalling the engine, and turns back on after resuming
it.

Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-03-15 13:29:15 +02:00
farah kassabri
25ebbc57ca accel/habanalabs: fix few misspelled words in the code
Run spell checker on the code and fix accordingly.

Signed-off-by: farah kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-03-15 13:29:15 +02:00
Koby Elbaz
75276e23d1 accel/habanalabs: verify return code after scrubbing ARCs DCCMs
In case the KDMA fails scrubbing the DCCMs (following a soft-reset
upon device release), the driver will only print failure until reset
flow ends, rather than escalating it into a hard-reset.

Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-03-15 13:29:15 +02:00
Tomer Tayar
81e609bebb accel/habanalabs: use notifications and graceful reset for decoder
Add notifications to user in case of decoder abnormal interrupts, and
use the graceful reset mechanism if reset is required.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-03-15 13:29:14 +02:00
Dafna Hirschfeld
86b74d8438 accel/habanalabs: assert return value of hw_fini
Since hw_fini return error code for failure indication, we should
check its return value. Currently it might only fail upon soft-reset
from hl_device_reset. Later patch will add hw_fini failure in case of
polling timeout in hard-reset.

Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-03-15 13:29:14 +02:00
Koby Elbaz
d85f0531b9 accel/habanalabs: break is_idle function into per-engine sub-routines
is_idle() was too long, so break it up for readability.
In addition, we can now use the new sub-routines from other places.

Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-03-15 13:29:14 +02:00