Fix deadlock in job submission and abort handling.
When a thread aborts currently executing jobs due to a fault,
it first locks the global lock protecting submitted_jobs (#1).
After the last job is destroyed, it proceeds to release the related context
and locks file_priv (#2). Meanwhile, in the job submission thread,
the file_priv lock (#2) is taken first, and then the submitted_jobs
lock (#1) is obtained when a job is added to the submitted jobs list.
CPU0 CPU1
---- ----
(for example due to a fault) (jobs submissions keep coming)
lock(&vdev->submitted_jobs_lock) #1
ivpu_jobs_abort_all()
job_destroy()
lock(&file_priv->lock) #2
lock(&vdev->submitted_jobs_lock) #1
file_priv_release()
lock(&vdev->context_list_lock)
lock(&file_priv->lock) #2
This order of locking causes a deadlock. To resolve this issue,
change the order of locking in ivpu_job_submit().
Signed-off-by: Karol Wachowski <karol.wachowski@intel.com>
Signed-off-by: Maciej Falkowski <maciej.falkowski@linux.intel.com>
Reviewed-by: Jacek Lawrynowicz <jacek.lawrynowicz@linux.intel.com>
Signed-off-by: Jacek Lawrynowicz <jacek.lawrynowicz@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20250107173238.381120-12-maciej.falkowski@linux.intel.com
To prevent looping infinitely in MMU event handler we stop
generating new events by removing 'R' (record) bit from context
descriptor, but to ensure this change has effect KMD has to perform
configuration invalidation followed by sync command.
Because of that move parts of the interrupt handler that can take longer
to a thread not to block in interrupt handler for too long.
This includes:
* disabling event queue for the time KMD updates MMU event queue consumer
to ensure proper synchronization between MMU and KMD
* removal of 'R' (record) bit from context descriptor to ensure no more
faults are recorded until that context is destroyed
Signed-off-by: Karol Wachowski <karol.wachowski@intel.com>
Signed-off-by: Maciej Falkowski <maciej.falkowski@linux.intel.com>
Reviewed-by: Jacek Lawrynowicz <jacek.lawrynowicz@linux.intel.com>
Signed-off-by: Jacek Lawrynowicz <jacek.lawrynowicz@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20250107173238.381120-8-maciej.falkowski@linux.intel.com
Stop dumping consecutive faults from an already faulty context immediately,
instead of waiting for the context abort thread handler (IRQ handler bottom
half) to abort currently executing jobs.
Remove 'R' (record events) bit from context descriptor of a faulty
context to prevent future faults generation.
This change speeds up the IRQ handler by eliminating the need to print the
fault content repeatedly. Additionally, it prevents flooding dmesg with
errors, which was occurring due to the delay in the bottom half of the
handler stopping fault-generating jobs.
Signed-off-by: Karol Wachowski <karol.wachowski@intel.com>
Signed-off-by: Maciej Falkowski <maciej.falkowski@linux.intel.com>
Reviewed-by: Jacek Lawrynowicz <jacek.lawrynowicz@linux.intel.com>
Signed-off-by: Jacek Lawrynowicz <jacek.lawrynowicz@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20250107173238.381120-7-maciej.falkowski@linux.intel.com
With hardware scheduler it is not expected to receive JOB_DONE
notifications from NPU FW for the jobs aborted due to command queue destroy
JSM command.
Remove jobs submitted to unregistered command queue from submitted_jobs_xa
to avoid triggering a TDR in such case.
Add explicit submitted_jobs_lock that protects access to list of submitted
jobs which is now used to find jobs to abort.
Move context abort procedure to separate work queue not to slow down
handling of IPCs or DCT requests in case where job abort takes longer,
especially when destruction of the last job of a specific context results
in context release.
Signed-off-by: Karol Wachowski <karol.wachowski@intel.com>
Signed-off-by: Maciej Falkowski <maciej.falkowski@linux.intel.com>
Reviewed-by: Jacek Lawrynowicz <jacek.lawrynowicz@linux.intel.com>
Signed-off-by: Jacek Lawrynowicz <jacek.lawrynowicz@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20250107173238.381120-4-maciej.falkowski@linux.intel.com
This driver fails to build in random configurations:
drivers/accel/amdxdna/aie2_solver.c: In function 'remove_partition_node':
drivers/accel/amdxdna/aie2_solver.c:121:9: error: implicit declaration of function 'kfree' [-Wimplicit-function-declaration]
121 | kfree(pt_node);
| ^~~~~
drivers/accel/amdxdna/aie2_solver.c: In function 'get_free_partition':
drivers/accel/amdxdna/aie2_solver.c:153:19: error: implicit declaration of function 'kzalloc' [-Wimplicit-function-declaration]
153 | pt_node = kzalloc(sizeof(*pt_node), GFP_KERNEL);
Add the missing include.
Fixes: c88d3325ae ("accel/amdxdna: Add hardware resource solver")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20241218085902.2684002-1-arnd@kernel.org
Add SET_STATE ioctl to configure device power mode for aie2 device.
Three modes are supported initially.
POWER_MODE_DEFAULT: Enable clock gating and set DPM (Dynamic Power
Management) level to value which has been set by resource solver or
maximum DPM level the device supports.
POWER_MODE_HIGH: Enable clock gating and set DPM level to maximum DPM
level the device supports.
POWER_MODE_TURBO: Disable clock gating and set DPM level to maximum DPM
level the device supports.
Disabling clock gating means all clocks always run on full speed. And
the different clock frequency are used based on DPM level been set.
Initially, the driver set the power mode to default mode.
Co-developed-by: Narendra Gutta <VenkataNarendraKumar.Gutta@amd.com>
Signed-off-by: Narendra Gutta <VenkataNarendraKumar.Gutta@amd.com>
Co-developed-by: George Yang <George.Yang@amd.com>
Signed-off-by: George Yang <George.Yang@amd.com>
Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Lizhi Hou <lizhi.hou@amd.com>
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20241213232933.1545388-4-lizhi.hou@amd.com
The old SET_SYSTEM_SLEEP_PM_OPS and SET_RUNTIME_PM_OPS macros cause a build
warning when CONFIG_PM is disabled:
drivers/accel/amdxdna/amdxdna_pci_drv.c:343:12: error: 'amdxdna_pmops_resume' defined but not used [-Werror=unused-function]
343 | static int amdxdna_pmops_resume(struct device *dev)
| ^~~~~~~~~~~~~~~~~~~~
drivers/accel/amdxdna/amdxdna_pci_drv.c:328:12: error: 'amdxdna_pmops_suspend' defined but not used [-Werror=unused-function]
328 | static int amdxdna_pmops_suspend(struct device *dev)
| ^~~~~~~~~~~~~~~~~~~~~
Change these to the modern replacements.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Lizhi Hou <lizhi.hou@amd.com>
Tested-by: Lizhi Hou <lizhi.hou@amd.com>
Signed-off-by: Jeffrey Hugo <quic_jhugo@quicinc.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20241213090259.68492-1-arnd@kernel.org
The v6.13-rc2 release included a bunch of breaking changes,
specifically the MODULE_IMPORT_NS commit.
Backmerge in order to fix them before the next pull-request.
Include the fix from Stephen Roswell.
Caused by commit
25c3fd1183 ("drm/virtio: Add a helper to map and note the dma addrs and lengths")
Interacting with commit
cdd30ebb1b ("module: Convert symbol namespace to string literal")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Link: https://patchwork.freedesktop.org/patch/msgid/20241209121717.2abe8026@canb.auug.org.au
Signed-off-by: Maarten Lankhorst <dev@lankhorst.se>
There different types of BOs are supported:
- shmem
A user application uses shmem BOs as input/output for its workload running
on NPU.
- device memory heap
The fixed size buffer dedicated to the device.
- device buffer
The buffer object allocated from device memory heap.
- command buffer
The buffer object created for delivering commands. The command buffer
object is small and pinned on creation.
New IOCTLs are added: CREATE_BO, GET_BO_INFO, SYNC_BO. SYNC_BO is used
to explicitly flush CPU cache for BO memory.
Co-developed-by: Min Ma <min.ma@amd.com>
Signed-off-by: Min Ma <min.ma@amd.com>
Reviewed-by: Jeffrey Hugo <quic_jhugo@quicinc.com>
Signed-off-by: Lizhi Hou <lizhi.hou@amd.com>
Signed-off-by: Jeffrey Hugo <quic_jhugo@quicinc.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20241118172942.2014541-7-lizhi.hou@amd.com
The hardware can be shared among multiple user applications. The
hardware resources are allocated/freed based on the request from
user application via driver IOCTLs.
DRM_IOCTL_AMDXDNA_CREATE_HWCTX
Allocate tile columns and create a hardware context structure to track the
usage and status of the resources. A hardware context ID is returned for
XDNA command execution.
DRM_IOCTL_AMDXDNA_DESTROY_HWCTX
Release hardware context based on its ID. The tile columns belong to
this hardware context will be reclaimed.
DRM_IOCTL_AMDXDNA_CONFIG_HWCTX
Config hardware context. Bind the hardware context to the required
resources.
Co-developed-by: Min Ma <min.ma@amd.com>
Signed-off-by: Min Ma <min.ma@amd.com>
Reviewed-by: Jeffrey Hugo <quic_jhugo@quicinc.com>
Signed-off-by: Lizhi Hou <lizhi.hou@amd.com>
Signed-off-by: Jeffrey Hugo <quic_jhugo@quicinc.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20241118172942.2014541-6-lizhi.hou@amd.com
The hardware mailboxes are used by the driver to submit requests to
firmware and receive the completion notices from hardware.
Initially, a management mailbox channel is up and running. The driver may
request firmware to create/destroy more channels dynamically through
management channel.
Add driver internal mailbox interfaces.
- create/destroy a mailbox channel instance
- send a message to the firmware through a specific channel
- wait for a notification from the specific channel
Co-developed-by: George Yang <George.Yang@amd.com>
Signed-off-by: George Yang <George.Yang@amd.com>
Co-developed-by: Min Ma <min.ma@amd.com>
Signed-off-by: Min Ma <min.ma@amd.com>
Reviewed-by: Jeffrey Hugo <quic_jhugo@quicinc.com>
Signed-off-by: Lizhi Hou <lizhi.hou@amd.com>
Signed-off-by: Jeffrey Hugo <quic_jhugo@quicinc.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20241118172942.2014541-4-lizhi.hou@amd.com