linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-05-10 11:40:19 -04:00

Author	SHA1	Message	Date
Maarten Lankhorst	ea54d49285	drm/xe/display: Remove dereferences of ggtt for tile id Obtain the id from the root tile. Likely this can be hardcoded to 0, but use the clean solution of obtaining root id and doing that. to_xe_device(ggtt->tile) can also be easily replaced with xe. Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://lore.kernel.org/r/20250505121924.921544-5-dev@lankhorst.se Signed-off-by: Maarten Lankhorst <dev@lankhorst.se>	2025-06-09 10:21:45 +02:00
Maarten Lankhorst	e0ee402750	drm/xe: Add xe_ggtt_alloc Instead of allocating inside xe_tile, create a new function that returns an allocated struct xe_ggtt from xe_ggtt.c Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Link: https://lore.kernel.org/r/20250505121924.921544-4-dev@lankhorst.se Signed-off-by: Maarten Lankhorst <dev@lankhorst.se>	2025-06-09 10:21:44 +02:00
Maarten Lankhorst	b5fe33dcb8	drm/xe: Add xe_ggtt_might_lock Another requirement of hiding more of struct xe_ggtt. Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://lore.kernel.org/r/20250505121924.921544-3-dev@lankhorst.se Signed-off-by: Maarten Lankhorst <dev@lankhorst.se>	2025-06-09 10:21:44 +02:00
Maarten Lankhorst	3975d35683	drm/xe: Use xe_ggtt_map_bo_unlocked for resume This is the first step to hide the details of struct xe_ggtt. Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Link: https://lore.kernel.org/r/20250505121924.921544-2-dev@lankhorst.se Signed-off-by: Maarten Lankhorst <dev@lankhorst.se>	2025-06-09 10:21:44 +02:00
Maarten Lankhorst	9a2d22a46b	drm/xe: Remove IOSF_MBI select. IOSF_MBI was only useful for some gen8 platforms, which were never supported by Xe. Presumably needed for display at one point, but display is fixed to put stubs in compat-i915-headers/vlv_sideband.h. (in drm-intel-next: vlv_iosf_sb.h) Link: https://lore.kernel.org/r/20250605074644.71036-1-dev@lankhorst.se Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Signed-off-by: Maarten Lankhorst <dev@lankhorst.se>	2025-06-09 10:16:17 +02:00
Michal Wajdeczko	bfb114751a	drm/xe/topology: Stop trying to fix programming mistakes We shouldn't ever pass more DSS registers than our hardcoded limit, it should be sufficient to just assert that instead of trying to fix it, as this will never happen in the production driver. Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Matt Roper <matthew.d.roper@intel.com> Reviewed-by: Matt Roper <matthew.d.roper@intel.com> Link: https://lore.kernel.org/r/20250604202908.769-4-michal.wajdeczko@intel.com Signed-off-by: Matt Roper <matthew.d.roper@intel.com>	2025-06-06 11:47:05 -07:00
Michal Wajdeczko	113c3f5968	drm/xe/topology: Use register array size instead magic number Since we keep registers in the array we can simply count them and stop relying on magic number when checking if didn't make mistake. Also we can switch to use xe_gt_assert() since it could be just our programming mistake during platform bringup, no need to keep drm_WARN() in the production driver. Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Matt Roper <matthew.d.roper@intel.com> Reviewed-by: Matt Roper <matthew.d.roper@intel.com> Link: https://lore.kernel.org/r/20250604202908.769-3-michal.wajdeczko@intel.com Signed-off-by: Matt Roper <matthew.d.roper@intel.com>	2025-06-06 11:47:05 -07:00
Michal Wajdeczko	f574855a4c	drm/xe/topology: Simplify code for loading DSS mask Instead of passing registers using va_list we can keep them in the static array and pass as such and also lower driver footprint: add/remove: 2/0 grow/shrink: 0/2 up/down: 24/-175 (-151) Function old new delta geometry_regs - 12 +12 compute_regs - 12 +12 xe_gt_topology_init 550 527 -23 load_dss_mask 449 297 -152 Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Matt Roper <matthew.d.roper@intel.com> Reviewed-by: Matt Roper <matthew.d.roper@intel.com> Link: https://lore.kernel.org/r/20250604202908.769-2-michal.wajdeczko@intel.com Signed-off-by: Matt Roper <matthew.d.roper@intel.com>	2025-06-06 11:47:05 -07:00
Dave Airlie	678bb4d625	drm/xe: don't store the xe device pointer inside xe_ttm_tt This device pointer is nearly always available without storing an extra copy for each tt in the system. Just noticed this while reading over the xe shrinker code. Signed-off-by: Dave Airlie <airlied@redhat.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Link: https://lore.kernel.org/r/20250605062103.1234620-1-airlied@gmail.com	2025-06-06 09:30:56 -07:00
Daniele Ceraolo Spurio	ade1473914	drm/xe/vf: Store the GuC FW info in guc->fw The GuC compatibility version that we read from the CSS header in native/PF and the GuC VF version that we get when a VF handshakes with the GuC are the same version number, so we should store it into the same structure. This makes all the checks based on the compatibility version automatically work for VFs without having to copy the value over. For completion, also copy the wanted version and set the path to a known string to indicate that the FW is PF-loaded. This way all the info will be coherent when dumped from debugfs. v2: several code cleanups and style changes (Michal), rebase on bootstrap changes. v3: s/min/wanted/, clarify that handshake must happen before we can get the VF versions (Michal) Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Cc: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Lukasz Laguna <lukasz.laguna@intel.com> Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Link: https://lore.kernel.org/r/20250603235432.720833-10-daniele.ceraolospurio@intel.com	2025-06-06 08:37:26 -07:00
Daniele Ceraolo Spurio	3ef462f003	drm/xe/vf: Use uc_fw_version to store the negotiated GuC ABI Instead of using a VF-specific type, we can use the common uc_fw_version structure. This also means that we can use the available macros to compare ABI versions. While at it, exit early from the bootstrap if this is not the first time we're doing it and the version hasn't changed, so we don't end up logging it multiple times. Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Cc: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Lukasz Laguna <lukasz.laguna@intel.com> Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Link: https://lore.kernel.org/r/20250603235432.720833-9-daniele.ceraolospurio@intel.com	2025-06-06 08:33:19 -07:00
Daniele Ceraolo Spurio	90f4d3f756	drm/xe/vf: Boostrap all GTs immediately after MMIO init Currently we perform the bootstrap for the primary GT early on during device init, while the media GT bootstrap happens when we try and fetch the hwconfig table. For consistency, move the bootstrap of the media GT happen at the same time as the primary GT, so that all the subsequent code can rely on both GTs being in the same state. v2: Also drop config query from min_guc_load since we now do it early (Michal) Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Cc: Michal Wajdeczko <michal.wajdeczko@intel.com> Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Link: https://lore.kernel.org/r/20250603235432.720833-8-daniele.ceraolospurio@intel.com	2025-06-06 08:33:18 -07:00
Daniele Ceraolo Spurio	10a2bc5807	drm/xe/uc: Prepare uc_fw_version for storing the VF ABI version The VF ABI version has a branch field, so to store it inside the uc_fw_version we need to add a new branch variable to the latter. Existing code needs to be updated to handle the fact that we have the new field. v2: split out to its own patch (Michal) Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Cc: Michal Wajdeczko <michal.wajdeczko@intel.com> Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Link: https://lore.kernel.org/r/20250603235432.720833-7-daniele.ceraolospurio@intel.com	2025-06-06 08:33:16 -07:00
Nitin Gote	8d8431298f	drm/xe/xe3: Disable null query for anyhit shader Set DIS_NULL_QUERY bit of RT_CTRL register to disable null query for anyhit shader for Xe3 IP. Reviewed-by: Gustavo Sousa <gustavo.sousa@intel.com> Reviewed-by: Tejas Upadhyay <tejas.upadhyay@intel.com> Reviewed-by: Matt Roper <matthew.d.roper@intel.com> Signed-off-by: Nitin Gote <nitin.r.gote@intel.com> Link: https://lore.kernel.org/r/20250605100812.2547808-1-nitin.r.gote@intel.com Signed-off-by: Matt Roper <matthew.d.roper@intel.com>	2025-06-05 15:57:37 -07:00
Matthew Auld	9d9fca62dc	drm/xe/guc_submit: add back fix Daniele noticed that the fix in commit `2d2be279f1` ("drm/xe: fix UAF around queue destruction") looks to have been unintentionally removed as part of handling a conflict in some past merge commit. Add it back. Fixes: `ac44ff7cec` ("Merge tag 'drm-xe-fixes-2024-10-10' of https://gitlab.freedesktop.org/drm/xe/kernel into drm-fixes") Reported-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: <stable@vger.kernel.org> # v6.12+ Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://lore.kernel.org/r/20250603174213.1543579-2-matthew.auld@intel.com	2025-06-05 09:15:25 +01:00
Matthew Brost	99e8050898	drm/xe: Make VMA tile_present, tile_invalidated access rules clear Document VMA tile_invalidated access rules, use READ_ONCE / WRITE_ONCE for opportunistic checks of tile_present and tile_invalidated, move tile_invalidated state change from page fault handler to PT code under the correct locks, and add lockdep asserts to TLB invalidation paths. v2: - Assert VM dma-resv lock rather than BO in zap PTEs v3: - Back to BO's dma-resv lock, adjust documentation v4: - Add WRITE_ONCE in xe_vm_invalidate_vma (Thomas) - Change lockdep assert for userptr in xe_vm_invalidate_vma (CI) - Take userptr notifier lock in read mode in xe_vm_userptr_pin before calling xe_vm_invalidate_vma (CI) v5: - Fix typos (Thomas) Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Link: https://lore.kernel.org/r/20250602164412.1912293-1-matthew.brost@intel.com	2025-06-04 07:38:53 -07:00
Maarten Lankhorst	531bef26d1	drm/xe/svm: Fix regression disallowing 64K SVM migration When changing the condition from >= SZ_64K, it was changed to <= SZ_64K. This disallows migration of 64K, which is the exact minimum allowed. Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/5057 Fixes: `a9ac0fa455` ("drm/xe: Strict migration policy for atomic SVM faults") Cc: stable@vger.kernel.org Cc: Matthew Brost <matthew.brost@intel.com> Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com> Reviewed-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com> Signed-off-by: Maarten Lankhorst <dev@lankhorst.se> Link: https://lore.kernel.org/r/20250521090102.2965100-1-dev@lankhorst.se	2025-06-04 12:01:04 +02:00
Michal Wajdeczko	c12c729e97	drm/xe/vf: Add sanity check for GGTT configuration The VF GGTT configuration was prepared by the PF, which should be trusted, was obtained from the GuC, which likely already did some sanity checks too, but since it's a received data, we should have our own sanity checks to detect early any misconfiguration. Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Satyanarayana K V P <satyanarayana.k.v.p@intel.com> Reviewed-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com> Link: https://lore.kernel.org/r/20250602103325.549-4-michal.wajdeczko@intel.com	2025-06-03 12:36:00 +02:00
Michal Wajdeczko	eb9b34734c	drm/xe/vf: Move tile-related VF functions to separate file Some of our VF functions, even if they take a GT pointer, work only on primary GT and really are tile-related and would be better to keep them separate from the rest of true GT-oriented functions. Move them to a file and update to take a tile pointer instead. Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Tomasz Lis <tomasz.lis@intel.com> Reviewed-by: Tomasz Lis <tomasz.lis@intel.com> Link: https://lore.kernel.org/r/20250602103325.549-3-michal.wajdeczko@intel.com	2025-06-03 12:35:57 +02:00
Michal Wajdeczko	ce2ae1b97a	drm/xe/vf: Introduce helpers to access GGTT configuration In upcoming patch we want to separate tile-oriented VF functions from GT-oriented functions and to allow the former access a GGTT configuration stored at GT level we need to provide some helpers. Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Tomasz Lis <tomasz.lis@intel.com> Reviewed-by: Tomasz Lis<tomasz.lis@intel.com> Link: https://lore.kernel.org/r/20250602103325.549-2-michal.wajdeczko@intel.com	2025-06-03 12:35:55 +02:00
Maciej Patelczyk	28b996ce73	drm/xe: remove unmatched xe_vm_unlock() from __xe_exec_queue_init() There is unmatched xe_vm_unlock() in the __xe_exec_queue_init(). Leftover from commit `fbeaad071a` ("drm/xe: Create LRC BO without VM") Fixes: `fbeaad071a` ("drm/xe: Create LRC BO without VM") Signed-off-by: Maciej Patelczyk <maciej.patelczyk@intel.com> Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com> Link: https://lore.kernel.org/r/20250530135627.2821612-1-maciej.patelczyk@intel.com	2025-06-02 14:04:21 -07:00
Lucas De Marchi	399c5f5409	drm/xe/configfs: Add internal API to documentation Add the internal configfs API like is done with other parts of the driver. Also ensure the correct number of '=' chars are used for the header. Reviewed-by: Matt Roper <matthew.d.roper@intel.com> Link: https://lore.kernel.org/r/20250528-engine-mask-v4-5-f4636d2a890a@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2025-06-02 11:00:46 -07:00
Lucas De Marchi	d09bc3edfe	drm/xe/configfs: Add attribute to disable engines Add the userspace interface to load the driver with fewer engines. The syntax is to just echo the engine names to a file in configfs, like below: echo 'rcs0,bcs0' > /sys/kernel/config/xe/<bdf>/engine_allowed With that engines other than rcs0 and bcs0 will not be enabled. To enable all instances from a class, a '*' can be used. Reviewed-by: Matt Roper <matthew.d.roper@intel.com> Link: https://lore.kernel.org/r/20250528-engine-mask-v4-4-f4636d2a890a@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2025-06-02 11:00:46 -07:00
Lucas De Marchi	58b51df807	drm/xe: Allow to disable engines Sometimes it's useful to load the driver with a smaller set of engines to allow more targeted debugging, particularly on early enabling. Besides checking what is fused off in hardware, add similar logic to disable engines in software. This will use configfs to allow users to set what engine to disable, so already add prepare for that. The exact configfs interface will be added later. Reviewed-by: Matt Roper <matthew.d.roper@intel.com> Link: https://lore.kernel.org/r/20250528-engine-mask-v4-3-f4636d2a890a@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2025-06-02 11:00:46 -07:00
Lucas De Marchi	2b0ef1f7a2	drm/xe: Convert "fused off" messages to be gt-based It's useful to see in the log message what GT was being checked for disabled/fused-off engines. Especially on multi-tile platforms the different tiles may be fused differently making it harder to parse the information. Reviewed-by: Matt Roper <matthew.d.roper@intel.com> Link: https://lore.kernel.org/r/20250528-engine-mask-v4-2-f4636d2a890a@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2025-06-02 11:00:46 -07:00
Lucas De Marchi	d8636cce7a	drm/xe/configfs: Drop trailing semicolons Drop the semicolons from the dummy implementation: they shouldn't be there. Reviewed-by: Matt Roper <matthew.d.roper@intel.com> Link: https://lore.kernel.org/r/20250528-engine-mask-v4-1-f4636d2a890a@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2025-06-02 11:00:46 -07:00
Michal Wajdeczko	d65650a9d1	drm/xe/guc: Resend potentially lost H2G MMIO request There could be a scenario where the VF driver is resuming faster than the driver PF is able to complete the VF FLR sequence which includes reset of the VF scratch registers. This may result in deletion of the ongoing HXG message (it could be either a host request or a GuC response). When we detect that HXG message was likey lost (scratch register with HXG header was zeroed) try to send this request once more before giving up. Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Satyanarayana K V P <satyanarayana.k.v.p@intel.com> Reviewed-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com> Link: https://lore.kernel.org/r/20250528090021.329-1-michal.wajdeczko@intel.com	2025-06-02 19:22:03 +02:00
Michal Wajdeczko	eae7844758	drm/xe: Use GT-oriented printer to dump topology on init During the probe we dump the discovered GT topology, but instead of a generic printer we can use our own GT-oriented printer which contains information about the source GT. Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Matt Roper <matthew.d.roper@intel.com> Reviewed-by: Matt Roper <matthew.d.roper@intel.com> Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Link: https://lore.kernel.org/r/20250530210524.505-1-michal.wajdeczko@intel.com	2025-06-02 19:19:57 +02:00
Michal Wajdeczko	8d425229ac	drm/xe: Convert page fault messages to be GT-oriented We are processing here G2H messages, so we should use GT oriented messages to retain information about the origin GT. While at it, print error codes in a user-friendly way. Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Lucas De Marchi <lucas.demarchi@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Matt Roper <matthew.d.roper@intel.com> Link: https://lore.kernel.org/r/20250530164835.461-1-michal.wajdeczko@intel.com	2025-06-02 19:12:54 +02:00
Lucas De Marchi	dea7a2b4f6	drm/xe/hwmon: Simplify and fix 32b wrap Like done in commit `eaa287069a` ("drm/xe/guc_submit: Simplify and fix diff calculation"), just use u32 for wrapping the value, which is simpler and more correct: when wrapping on 32b, the accumulated value was off by one. Also, do not mix the u64 value from pmt with the u32 value used for the calculation. Cc: Badal Nilawar <badal.nilawar@intel.com> Cc: Raag Jadav <raag.jadav@intel.com> Reviewed-by: Raag Jadav <raag.jadav@intel.com> Link: https://lore.kernel.org/r/20250530-xe-hwmon-wrap-v2-1-ce653db7fe4a@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2025-06-02 09:34:08 -07:00
Daniele Ceraolo Spurio	ccd3c6820a	drm/xe/pxp: Decouple queue addition from PXP start Starting PXP and adding a queue to the PXP queue list are separate actions. Given that a queue can only be added to the list if PXP is active, the 2 actions were bundled together to avoid having to re-lock and re-check the status to perform the queue addition after having done so during the PXP start. However, we don't save a lot of complexity by doing so and we lose in clarity of code, so overall it's cleaner to just keep the 2 actions separate. v2: remove leftover rpm_get (John), fix rpm_put in error case Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Cc: John Harrison <John.C.Harrison@Intel.com> Reviewed-by: John Harrison <John.C.Harrison@Intel.com> Link: https://lore.kernel.org/r/20250522225401.3953243-8-daniele.ceraolospurio@intel.com	2025-06-02 08:28:49 -07:00
Daniele Ceraolo Spurio	21784ca960	drm/xe/pxp: Clarify PXP queue creation behavior if PXP is not ready The expected flow of operations when using PXP is to query the PXP status and wait for it to transition to "ready" before attempting to create an exec_queue. This flow is followed by the Mesa driver, but there is no guarantee that an incorrectly coded (or malicious) app will not attempt to create the queue first without querying the status. Therefore, we need to clarify what the expected behavior of the queue creation ioctl is in this scenario. Currently, the ioctl always fails with an -EBUSY code no matter the error, but for consistency it is better to distinguish between "failed to init" (-EIO) and "not ready" (-EBUSY), the same way the query ioctl does. Note that, while this is a change in the return code of an ioctl, the behavior of the ioctl in this particular corner case was not clearly spec'd, so no one should have been relying on it (and we know that Mesa, which is the only known userspace for this, didn't). v2: Minor rework of the doc (Rodrigo) Fixes: `72d479601d` ("drm/xe/pxp/uapi: Add userspace and LRC support for PXP-using queues") Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Cc: John Harrison <John.C.Harrison@Intel.com> Cc: José Roberto de Souza <jose.souza@intel.com> Reviewed-by: José Roberto de Souza <jose.souza@intel.com> Reviewed-by: John Harrison <John.C.Harrison@Intel.com> Acked-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://lore.kernel.org/r/20250522225401.3953243-7-daniele.ceraolospurio@intel.com	2025-06-02 08:28:48 -07:00
Daniele Ceraolo Spurio	1d891ee820	drm/xe/pxp: Use the correct define in the set_property_funcs array The define of the extension type was accidentally used instead of the one of the property itself. They're both zero, so no functional issue, but we should use the correct define for code correctness. Fixes: `41a97c4a12` ("drm/xe/pxp/uapi: Add API to mark a BO as using PXP") Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Cc: John Harrison <John.C.Harrison@Intel.com> Reviewed-by: John Harrison <John.C.Harrison@Intel.com> Link: https://lore.kernel.org/r/20250522225401.3953243-6-daniele.ceraolospurio@intel.com	2025-06-02 08:28:47 -07:00
Matthew Auld	38fafa9f39	drm/xe/sched: stop re-submitting signalled jobs Customer is reporting a really subtle issue where we get random DMAR faults, hangs and other nasties for kernel migration jobs when stressing stuff like s2idle/s3/s4. The explosions seems to happen somewhere after resuming the system with splats looking something like: PM: suspend exit rfkill: input handler disabled xe 0000:00:02.0: [drm] GT0: Engine reset: engine_class=bcs, logical_mask: 0x2, guc_id=0 xe 0000:00:02.0: [drm] GT0: Timedout job: seqno=24496, lrc_seqno=24496, guc_id=0, flags=0x13 in no process [-1] xe 0000:00:02.0: [drm] GT0: Kernel-submitted job timed out The likely cause appears to be a race between suspend cancelling the worker that processes the free_job()'s, such that we still have pending jobs to be freed after the cancel. Following from this, on resume the pending_list will now contain at least one already complete job, but it looks like we call drm_sched_resubmit_jobs(), which will then call run_job() on everything still on the pending_list. But if the job was already complete, then all the resources tied to the job, like the bb itself, any memory that is being accessed, the iommu mappings etc. might be long gone since those are usually tied to the fence signalling. This scenario can be seen in ftrace when running a slightly modified xe_pm IGT (kernel was only modified to inject artificial latency into free_job to make the race easier to hit): xe_sched_job_run: dev=0000:00:02.0, fence=0xffff888276cc8540, seqno=0, lrc_seqno=0, gt=0, guc_id=0, batch_addr=0x000000146910 ... xe_exec_queue_stop: dev=0000:00:02.0, 3:0x2, gt=0, width=1, guc_id=0, guc_state=0x0, flags=0x13 xe_exec_queue_stop: dev=0000:00:02.0, 3:0x2, gt=0, width=1, guc_id=1, guc_state=0x0, flags=0x4 xe_exec_queue_stop: dev=0000:00:02.0, 4:0x1, gt=1, width=1, guc_id=0, guc_state=0x0, flags=0x3 xe_exec_queue_stop: dev=0000:00:02.0, 1:0x1, gt=1, width=1, guc_id=1, guc_state=0x0, flags=0x3 xe_exec_queue_stop: dev=0000:00:02.0, 4:0x1, gt=1, width=1, guc_id=2, guc_state=0x0, flags=0x3 xe_exec_queue_resubmit: dev=0000:00:02.0, 3:0x2, gt=0, width=1, guc_id=0, guc_state=0x0, flags=0x13 xe_sched_job_run: dev=0000:00:02.0, fence=0xffff888276cc8540, seqno=0, lrc_seqno=0, gt=0, guc_id=0, batch_addr=0x000000146910 ... ..... xe_exec_queue_memory_cat_error: dev=0000:00:02.0, 3:0x2, gt=0, width=1, guc_id=0, guc_state=0x3, flags=0x13 So the job_run() is clearly triggered twice for the same job, even though the first must have already signalled to completion during suspend. We can also see a CAT error after the re-submit. To prevent this only resubmit jobs on the pending_list that have not yet signalled. v2: - Make sure to re-arm the fence callbacks with sched_start(). v3 (Matt B): - Stop using drm_sched_resubmit_jobs(), which appears to be deprecated and just open-code a simple loop such that we skip calling run_job() on anything already signalled. Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/4856 Fixes: `dd08ebf6c3` ("drm/xe: Introduce a new DRM driver for Intel GPUs") Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: William Tseng <william.tseng@intel.com> Cc: <stable@vger.kernel.org> # v6.8+ Reviewed-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Tejas Upadhyay <tejas.upadhyay@intel.com> Link: https://lore.kernel.org/r/20250528113328.289392-2-matthew.auld@intel.com	2025-06-02 12:03:46 +01:00
Thomas Hellström	9d5558649f	drm/xe: Rework eviction rejection of bound external bos For preempt_fence mode VM's we're rejecting eviction of shared bos during VM_BIND. However, since we do this in the move() callback, we're getting an eviction failure warning from TTM. The TTM callback intended for these things is eviction_valuable(). However, the latter doesn't pass in the struct ttm_operation_ctx needed to determine whether the caller needs this. Instead, attach the needed information to the vm under the vm->resv, until we've been able to update TTM to provide the needed information. And add sufficient lockdep checks to prevent misuse and races. v2: - Fix a copy-paste error in xe_vm_clear_validating() v3: - Fix kerneldoc errors. Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Fixes: `0af944f0e3` ("drm/xe: Reject BO eviction if BO is bound to current VM") Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://lore.kernel.org/r/20250528164105.234718-1-thomas.hellstrom@linux.intel.com	2025-06-02 11:49:05 +02:00
Arnd Bergmann	e4931f8be3	drm/xe/vsec: fix CONFIG_INTEL_VSEC dependency The XE driver can be built with or without VSEC support, but fails to link as built-in if vsec is in a loadable module: x86_64-linux-ld: vmlinux.o: in function `xe_vsec_init': (.text+0x1e83e16): undefined reference to `intel_vsec_register' The normal fix for this is to add a 'depends on INTEL_VSEC \|\| !INTEL_VSEC', forcing XE to be a loadable module as well, but that causes a circular dependency: symbol DRM_XE depends on INTEL_VSEC symbol INTEL_VSEC depends on X86_PLATFORM_DEVICES symbol X86_PLATFORM_DEVICES is selected by DRM_XE The problem here is selecting a symbol from another subsystem, so change that as well and rephrase the 'select' into the corresponding dependency. Since X86_PLATFORM_DEVICES is 'default y', there is no change to defconfig builds here. Fixes: `0c45e76fcc` ("drm/xe/vsec: Support BMG devices") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Link: https://lore.kernel.org/r/20250529172355.2395634-2-lucas.demarchi@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2025-05-30 14:25:00 -07:00
Raag Jadav	61761a6b57	drm/xe: drop redundant conversion to bool The result of integer comparison already evaluates to bool. No need for explicit conversion. No functional impact. Fixes: `0e414bf7ad` ("drm/xe: Expose PCIe link downgrade attributes") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202505292205.MoljmkjQ-lkp@intel.com/ Signed-off-by: Raag Jadav <raag.jadav@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://lore.kernel.org/r/20250529160937.490147-1-raag.jadav@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2025-05-30 11:31:56 -04:00
Karthik Poosa	48a1126836	drm/xe/hwmon: Expose power sysfs entries based on firmware support Enable hwmon sysfs entries (power_xxx) only when GPU firmware supports it. Previously, these entries were created if the MMIO register was present. Now, we enable based on the data in the register. v2: Remove a unnecessary comment. (Rodrigo) Signed-off-by: Karthik Poosa <karthik.poosa@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://lore.kernel.org/r/20250529163458.2354509-7-karthik.poosa@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2025-05-30 11:30:01 -04:00
Karthik Poosa	0c5405d3aa	drm/xe/hwmon: Read energy status from PMT Read card and package energy status using pmt apis instead of xe_mmio for supported platforms. Enable Battlemage to read energy from PMT. v2: - Remove unused has_pmt_energy field. (Badal) - Use GENMASK to extract energy data. (Badal) v3: - Move PMT energy register offset and GENMASK to xe_pmt.h - Address review comments. (Jani) v4: - Remove unnecessary debug print. (Badal) v5: - Resolve an unused variable warning. - Add a return value check. Signed-off-by: Karthik Poosa <karthik.poosa@intel.com> Reviewed-by: Badal Nilawar <badal.nilawar@intel.com> Link: https://lore.kernel.org/r/20250529163458.2354509-6-karthik.poosa@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2025-05-30 11:30:01 -04:00
Karthik Poosa	719d8a5959	drm/xe/hwmon: Expose powerX_cap_interval Expose powerX_cap_interval to manage burst power limit time window. Signed-off-by: Karthik Poosa <karthik.poosa@intel.com> Reviewed-by: Badal Nilawar <badal.nilawar@intel.com> Link: https://lore.kernel.org/r/20250529163458.2354509-5-karthik.poosa@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2025-05-30 11:30:01 -04:00
Karthik Poosa	c713b9a23c	drm/xe/hwmon: Add support to manage PL2 though mailbox Add support to manage power limit PL2 (burst limit) through pcode mailbox commands. v2: - Update power1_cap definition in hwmon documentation. (Badal) - Clamp PL2 power limit to GPU firmware default value. v3: - Activate the power label when either the PL1 or PL2 power limit is enabled. v4: - Update description of pl2_on_boot variable to fix kernel-doc error. v5: - Remove unnecessary drm_warn. - Rectify powerX_label permission to read-only on platforms without mailbox power limits support. - Expose powerX_cap entries only on platforms with mailbox support. v6: - Improve commit message, refer to BIOS as GPU firmware. - Refer to card firmware as GPU firmware in code. - Remove unnecessary drm_dbg. - Print supported and unsupported power limits. (Rodrigo) - Enable powerN_cap/max_xxx entries only when power limits supported in GPU firmware. Signed-off-by: Karthik Poosa <karthik.poosa@intel.com> Reviewed-by: Badal Nilawar <badal.nilawar@intel.com> Link: https://lore.kernel.org/r/20250529163458.2354509-4-karthik.poosa@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2025-05-30 11:30:01 -04:00
Karthik Poosa	25e963a09e	drm/xe/hwmon: Move card reactive critical power under channel card Move power2/curr2_crit to channel 1 i.e power1/curr1_crit as this represents the entire card critical power/current. v2: Update the date of curr1_crit also in hwmon documentation. Signed-off-by: Karthik Poosa <karthik.poosa@intel.com> Fixes: `345dadc4f6` ("drm/xe/hwmon: Add infra to support card power and energy attributes") Reviewed-by: Badal Nilawar <badal.nilawar@intel.com> Link: https://lore.kernel.org/r/20250529163458.2354509-3-karthik.poosa@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2025-05-30 11:30:01 -04:00
Karthik Poosa	7596d839f6	drm/xe/hwmon: Add support to manage power limits though mailbox Add support to manage power limits using pcode mailbox commands for supported platforms. v2: - Address review comments. (Badal) - Use mailbox commands instead of registers to manage power limits for BMG. - Clamp the maximum power limit to GPU firmware default value. v3: - Clamp power limit in write also for platforms with mailbox support. v4: - Remove unnecessary debug prints. (Badal) v5: - Update description of variable pl1_on_boot to fix kernel-doc error. v6: - Improve commit message, refer to BIOS as GPU firmware. - Change macro READ_PL_FROM_BIOS to READ_PL_FROM_FW. - Rectify drm_warn to drm_info. Signed-off-by: Karthik Poosa <karthik.poosa@intel.com> Fixes: `e90f7a58e6` ("drm/xe/hwmon: Add HWMON support for BMG") Reviewed-by: Badal Nilawar <badal.nilawar@intel.com> Link: https://lore.kernel.org/r/20250529163458.2354509-2-karthik.poosa@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2025-05-30 11:30:01 -04:00
Matthew Brost	1a524e8b48	drm/xe: Do not warn on SVM migration failing because of 64k requirements On platforms which only support 64k VRAM pages, it is expected that 4k faults will not migrate. Do not warn on this, rather print a debug message. Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://lore.kernel.org/r/20250529164338.1745515-1-matthew.brost@intel.com	2025-05-29 21:52:15 -07:00
Balasubramani Vivekanandan	241cc827c0	drm/xe/mocs: Initialize MOCS index early MOCS uc_index is used even before it is initialized in the following callstack guc_prepare_xfer() __xe_guc_upload() xe_guc_min_load_for_hwconfig() xe_uc_init_hwconfig() xe_gt_init_hwconfig() Do MOCS index initialization earlier in the device probe. Signed-off-by: Balasubramani Vivekanandan <balasubramani.vivekanandan@intel.com> Reviewed-by: Ravi Kumar Vodapalli <ravi.kumar.vodapalli@intel.com> Link: https://lore.kernel.org/r/20250520142445.2792824-1-balasubramani.vivekanandan@intel.com Signed-off-by: Matt Roper <matthew.d.roper@intel.com>	2025-05-29 14:29:18 -07:00
Niranjana Vishwanathapura	fbeaad071a	drm/xe: Create LRC BO without VM Specifying VM during lrc->bo creation requires VM's reference to be held for the lifetime of lrc->bo as it will use VM's dma reservation object. Using VM's dma reservation object for lrc->bo doesn't provide any advantage. Hence do not pass VM while creating lrc->bo. v2: Use xe_bo_unpin_map_no_vm (Matthew Brost) Fixes: `264eecdba2` ("drm/xe: Decouple xe_exec_queue and xe_lrc") Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Link: https://lore.kernel.org/r/20250529052031.2429120-2-niranjana.vishwanathapura@intel.com	2025-05-29 09:18:31 -07:00
Matthew Auld	4f296d77cf	drm/xe/vm: move xe_svm_init() earlier In xe_vm_close_and_put() we need to be able to call xe_svm_fini(), however during vm creation we can call this on the error path, before having actually initialised the svm state, leading to various splats followed by a fatal NPD. Fixes: `6fd979c2f3` ("drm/xe: Add SVM init / close / fini to faulting VMs") Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/4967 Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://lore.kernel.org/r/20250514152424.149591-4-matthew.auld@intel.com	2025-05-29 11:56:03 +01:00
Matthew Auld	96af397aa1	drm/xe/vm: move rebind_work init earlier In xe_vm_close_and_put() we need to be able to call flush_work(rebind_work), however during vm creation we can call this on the error path, before having actually set up the worker, leading to a splat from flush_work(). It looks like we can simply move the worker init step earlier to fix this. Fixes: `dd08ebf6c3` ("drm/xe: Introduce a new DRM driver for Intel GPUs") Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: <stable@vger.kernel.org> # v6.8+ Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://lore.kernel.org/r/20250514152424.149591-3-matthew.auld@intel.com	2025-05-29 11:56:01 +01:00
Matthew Auld	338ec84dee	drm/xe/bo: optimise CCS case for WB pages Dealing with CCS state is significant on LNL+, where we end up clearing the compression state on every page alloc using the blitter for user buffers, including also saving and restoring it when moving between domains, plus we need to alloc extra pages to hold the raw CCS state for the save step. However all compression PAT modes, on platforms like LNL, also require coh_none, meaning that only WC memory can use compression in the first place. With this we can be sneaky and completely ignore CCS for WB buffers, which is likely the common case anyway. This would then skip all blitter moves/clears between sys <-> tt and then also means we can drop the extra CCS pages. This should be safe since there is no way to interact with the compression state (potentially uncleared) without using a PAT enabled index (which is rejected at bind), including if trying to be malicious and copy the raw CCS state from userpace, which should give back all zeroes if the src surface (indirect) is lacking compressed PAT index. Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Satyanarayana K V P <satyanarayana.k.v.p@intel.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Reviewed-by: José Roberto de Souza <jose.souza@intel.com> Reviewed-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com> Link: https://lore.kernel.org/r/20250516153810.223530-2-matthew.auld@intel.com	2025-05-29 11:54:45 +01:00
Michal Wajdeczko	2cb38bb0ad	drm/xe: Allow to trigger GT resets using debugfs writes Today we allow to trigger GT resest by reading dedicated debugfs files "force_reset" and "force_reset_sync" that we are exposing using drm_info_list[] and drm_debugfs_create_files(). To avoid triggering potentially disruptive actions during otherwise "safe" read operations, expose those two attributes using debugfs function where we can specify file permissions and provide custom "write" handler to trigger the GT resets also from there. This step would allow us to drop triggering GT resets during read operations, which we leave just to give users more time to switch. Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Lucas De Marchi <lucas.demarchi@intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://lore.kernel.org/r/20250519200914.216-1-michal.wajdeczko@intel.com	2025-05-28 20:13:18 +02:00

1 2 3 4 5 ...

1352089 Commits