linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-05-10 05:39:42 -04:00

Author	SHA1	Message	Date
Matthew Auld	38fafa9f39	drm/xe/sched: stop re-submitting signalled jobs Customer is reporting a really subtle issue where we get random DMAR faults, hangs and other nasties for kernel migration jobs when stressing stuff like s2idle/s3/s4. The explosions seems to happen somewhere after resuming the system with splats looking something like: PM: suspend exit rfkill: input handler disabled xe 0000:00:02.0: [drm] GT0: Engine reset: engine_class=bcs, logical_mask: 0x2, guc_id=0 xe 0000:00:02.0: [drm] GT0: Timedout job: seqno=24496, lrc_seqno=24496, guc_id=0, flags=0x13 in no process [-1] xe 0000:00:02.0: [drm] GT0: Kernel-submitted job timed out The likely cause appears to be a race between suspend cancelling the worker that processes the free_job()'s, such that we still have pending jobs to be freed after the cancel. Following from this, on resume the pending_list will now contain at least one already complete job, but it looks like we call drm_sched_resubmit_jobs(), which will then call run_job() on everything still on the pending_list. But if the job was already complete, then all the resources tied to the job, like the bb itself, any memory that is being accessed, the iommu mappings etc. might be long gone since those are usually tied to the fence signalling. This scenario can be seen in ftrace when running a slightly modified xe_pm IGT (kernel was only modified to inject artificial latency into free_job to make the race easier to hit): xe_sched_job_run: dev=0000:00:02.0, fence=0xffff888276cc8540, seqno=0, lrc_seqno=0, gt=0, guc_id=0, batch_addr=0x000000146910 ... xe_exec_queue_stop: dev=0000:00:02.0, 3:0x2, gt=0, width=1, guc_id=0, guc_state=0x0, flags=0x13 xe_exec_queue_stop: dev=0000:00:02.0, 3:0x2, gt=0, width=1, guc_id=1, guc_state=0x0, flags=0x4 xe_exec_queue_stop: dev=0000:00:02.0, 4:0x1, gt=1, width=1, guc_id=0, guc_state=0x0, flags=0x3 xe_exec_queue_stop: dev=0000:00:02.0, 1:0x1, gt=1, width=1, guc_id=1, guc_state=0x0, flags=0x3 xe_exec_queue_stop: dev=0000:00:02.0, 4:0x1, gt=1, width=1, guc_id=2, guc_state=0x0, flags=0x3 xe_exec_queue_resubmit: dev=0000:00:02.0, 3:0x2, gt=0, width=1, guc_id=0, guc_state=0x0, flags=0x13 xe_sched_job_run: dev=0000:00:02.0, fence=0xffff888276cc8540, seqno=0, lrc_seqno=0, gt=0, guc_id=0, batch_addr=0x000000146910 ... ..... xe_exec_queue_memory_cat_error: dev=0000:00:02.0, 3:0x2, gt=0, width=1, guc_id=0, guc_state=0x3, flags=0x13 So the job_run() is clearly triggered twice for the same job, even though the first must have already signalled to completion during suspend. We can also see a CAT error after the re-submit. To prevent this only resubmit jobs on the pending_list that have not yet signalled. v2: - Make sure to re-arm the fence callbacks with sched_start(). v3 (Matt B): - Stop using drm_sched_resubmit_jobs(), which appears to be deprecated and just open-code a simple loop such that we skip calling run_job() on anything already signalled. Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/4856 Fixes: `dd08ebf6c3` ("drm/xe: Introduce a new DRM driver for Intel GPUs") Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: William Tseng <william.tseng@intel.com> Cc: <stable@vger.kernel.org> # v6.8+ Reviewed-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Tejas Upadhyay <tejas.upadhyay@intel.com> Link: https://lore.kernel.org/r/20250528113328.289392-2-matthew.auld@intel.com	2025-06-02 12:03:46 +01:00
Thomas Hellström	9d5558649f	drm/xe: Rework eviction rejection of bound external bos For preempt_fence mode VM's we're rejecting eviction of shared bos during VM_BIND. However, since we do this in the move() callback, we're getting an eviction failure warning from TTM. The TTM callback intended for these things is eviction_valuable(). However, the latter doesn't pass in the struct ttm_operation_ctx needed to determine whether the caller needs this. Instead, attach the needed information to the vm under the vm->resv, until we've been able to update TTM to provide the needed information. And add sufficient lockdep checks to prevent misuse and races. v2: - Fix a copy-paste error in xe_vm_clear_validating() v3: - Fix kerneldoc errors. Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Fixes: `0af944f0e3` ("drm/xe: Reject BO eviction if BO is bound to current VM") Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://lore.kernel.org/r/20250528164105.234718-1-thomas.hellstrom@linux.intel.com	2025-06-02 11:49:05 +02:00
Arnd Bergmann	e4931f8be3	drm/xe/vsec: fix CONFIG_INTEL_VSEC dependency The XE driver can be built with or without VSEC support, but fails to link as built-in if vsec is in a loadable module: x86_64-linux-ld: vmlinux.o: in function `xe_vsec_init': (.text+0x1e83e16): undefined reference to `intel_vsec_register' The normal fix for this is to add a 'depends on INTEL_VSEC \|\| !INTEL_VSEC', forcing XE to be a loadable module as well, but that causes a circular dependency: symbol DRM_XE depends on INTEL_VSEC symbol INTEL_VSEC depends on X86_PLATFORM_DEVICES symbol X86_PLATFORM_DEVICES is selected by DRM_XE The problem here is selecting a symbol from another subsystem, so change that as well and rephrase the 'select' into the corresponding dependency. Since X86_PLATFORM_DEVICES is 'default y', there is no change to defconfig builds here. Fixes: `0c45e76fcc` ("drm/xe/vsec: Support BMG devices") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Link: https://lore.kernel.org/r/20250529172355.2395634-2-lucas.demarchi@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2025-05-30 14:25:00 -07:00
Raag Jadav	61761a6b57	drm/xe: drop redundant conversion to bool The result of integer comparison already evaluates to bool. No need for explicit conversion. No functional impact. Fixes: `0e414bf7ad` ("drm/xe: Expose PCIe link downgrade attributes") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202505292205.MoljmkjQ-lkp@intel.com/ Signed-off-by: Raag Jadav <raag.jadav@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://lore.kernel.org/r/20250529160937.490147-1-raag.jadav@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2025-05-30 11:31:56 -04:00
Karthik Poosa	48a1126836	drm/xe/hwmon: Expose power sysfs entries based on firmware support Enable hwmon sysfs entries (power_xxx) only when GPU firmware supports it. Previously, these entries were created if the MMIO register was present. Now, we enable based on the data in the register. v2: Remove a unnecessary comment. (Rodrigo) Signed-off-by: Karthik Poosa <karthik.poosa@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://lore.kernel.org/r/20250529163458.2354509-7-karthik.poosa@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2025-05-30 11:30:01 -04:00
Karthik Poosa	0c5405d3aa	drm/xe/hwmon: Read energy status from PMT Read card and package energy status using pmt apis instead of xe_mmio for supported platforms. Enable Battlemage to read energy from PMT. v2: - Remove unused has_pmt_energy field. (Badal) - Use GENMASK to extract energy data. (Badal) v3: - Move PMT energy register offset and GENMASK to xe_pmt.h - Address review comments. (Jani) v4: - Remove unnecessary debug print. (Badal) v5: - Resolve an unused variable warning. - Add a return value check. Signed-off-by: Karthik Poosa <karthik.poosa@intel.com> Reviewed-by: Badal Nilawar <badal.nilawar@intel.com> Link: https://lore.kernel.org/r/20250529163458.2354509-6-karthik.poosa@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2025-05-30 11:30:01 -04:00
Karthik Poosa	719d8a5959	drm/xe/hwmon: Expose powerX_cap_interval Expose powerX_cap_interval to manage burst power limit time window. Signed-off-by: Karthik Poosa <karthik.poosa@intel.com> Reviewed-by: Badal Nilawar <badal.nilawar@intel.com> Link: https://lore.kernel.org/r/20250529163458.2354509-5-karthik.poosa@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2025-05-30 11:30:01 -04:00
Karthik Poosa	c713b9a23c	drm/xe/hwmon: Add support to manage PL2 though mailbox Add support to manage power limit PL2 (burst limit) through pcode mailbox commands. v2: - Update power1_cap definition in hwmon documentation. (Badal) - Clamp PL2 power limit to GPU firmware default value. v3: - Activate the power label when either the PL1 or PL2 power limit is enabled. v4: - Update description of pl2_on_boot variable to fix kernel-doc error. v5: - Remove unnecessary drm_warn. - Rectify powerX_label permission to read-only on platforms without mailbox power limits support. - Expose powerX_cap entries only on platforms with mailbox support. v6: - Improve commit message, refer to BIOS as GPU firmware. - Refer to card firmware as GPU firmware in code. - Remove unnecessary drm_dbg. - Print supported and unsupported power limits. (Rodrigo) - Enable powerN_cap/max_xxx entries only when power limits supported in GPU firmware. Signed-off-by: Karthik Poosa <karthik.poosa@intel.com> Reviewed-by: Badal Nilawar <badal.nilawar@intel.com> Link: https://lore.kernel.org/r/20250529163458.2354509-4-karthik.poosa@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2025-05-30 11:30:01 -04:00
Karthik Poosa	25e963a09e	drm/xe/hwmon: Move card reactive critical power under channel card Move power2/curr2_crit to channel 1 i.e power1/curr1_crit as this represents the entire card critical power/current. v2: Update the date of curr1_crit also in hwmon documentation. Signed-off-by: Karthik Poosa <karthik.poosa@intel.com> Fixes: `345dadc4f6` ("drm/xe/hwmon: Add infra to support card power and energy attributes") Reviewed-by: Badal Nilawar <badal.nilawar@intel.com> Link: https://lore.kernel.org/r/20250529163458.2354509-3-karthik.poosa@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2025-05-30 11:30:01 -04:00
Karthik Poosa	7596d839f6	drm/xe/hwmon: Add support to manage power limits though mailbox Add support to manage power limits using pcode mailbox commands for supported platforms. v2: - Address review comments. (Badal) - Use mailbox commands instead of registers to manage power limits for BMG. - Clamp the maximum power limit to GPU firmware default value. v3: - Clamp power limit in write also for platforms with mailbox support. v4: - Remove unnecessary debug prints. (Badal) v5: - Update description of variable pl1_on_boot to fix kernel-doc error. v6: - Improve commit message, refer to BIOS as GPU firmware. - Change macro READ_PL_FROM_BIOS to READ_PL_FROM_FW. - Rectify drm_warn to drm_info. Signed-off-by: Karthik Poosa <karthik.poosa@intel.com> Fixes: `e90f7a58e6` ("drm/xe/hwmon: Add HWMON support for BMG") Reviewed-by: Badal Nilawar <badal.nilawar@intel.com> Link: https://lore.kernel.org/r/20250529163458.2354509-2-karthik.poosa@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2025-05-30 11:30:01 -04:00
Matthew Brost	1a524e8b48	drm/xe: Do not warn on SVM migration failing because of 64k requirements On platforms which only support 64k VRAM pages, it is expected that 4k faults will not migrate. Do not warn on this, rather print a debug message. Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://lore.kernel.org/r/20250529164338.1745515-1-matthew.brost@intel.com	2025-05-29 21:52:15 -07:00
Balasubramani Vivekanandan	241cc827c0	drm/xe/mocs: Initialize MOCS index early MOCS uc_index is used even before it is initialized in the following callstack guc_prepare_xfer() __xe_guc_upload() xe_guc_min_load_for_hwconfig() xe_uc_init_hwconfig() xe_gt_init_hwconfig() Do MOCS index initialization earlier in the device probe. Signed-off-by: Balasubramani Vivekanandan <balasubramani.vivekanandan@intel.com> Reviewed-by: Ravi Kumar Vodapalli <ravi.kumar.vodapalli@intel.com> Link: https://lore.kernel.org/r/20250520142445.2792824-1-balasubramani.vivekanandan@intel.com Signed-off-by: Matt Roper <matthew.d.roper@intel.com>	2025-05-29 14:29:18 -07:00
Niranjana Vishwanathapura	fbeaad071a	drm/xe: Create LRC BO without VM Specifying VM during lrc->bo creation requires VM's reference to be held for the lifetime of lrc->bo as it will use VM's dma reservation object. Using VM's dma reservation object for lrc->bo doesn't provide any advantage. Hence do not pass VM while creating lrc->bo. v2: Use xe_bo_unpin_map_no_vm (Matthew Brost) Fixes: `264eecdba2` ("drm/xe: Decouple xe_exec_queue and xe_lrc") Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Link: https://lore.kernel.org/r/20250529052031.2429120-2-niranjana.vishwanathapura@intel.com	2025-05-29 09:18:31 -07:00
Matthew Auld	4f296d77cf	drm/xe/vm: move xe_svm_init() earlier In xe_vm_close_and_put() we need to be able to call xe_svm_fini(), however during vm creation we can call this on the error path, before having actually initialised the svm state, leading to various splats followed by a fatal NPD. Fixes: `6fd979c2f3` ("drm/xe: Add SVM init / close / fini to faulting VMs") Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/4967 Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://lore.kernel.org/r/20250514152424.149591-4-matthew.auld@intel.com	2025-05-29 11:56:03 +01:00
Matthew Auld	96af397aa1	drm/xe/vm: move rebind_work init earlier In xe_vm_close_and_put() we need to be able to call flush_work(rebind_work), however during vm creation we can call this on the error path, before having actually set up the worker, leading to a splat from flush_work(). It looks like we can simply move the worker init step earlier to fix this. Fixes: `dd08ebf6c3` ("drm/xe: Introduce a new DRM driver for Intel GPUs") Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: <stable@vger.kernel.org> # v6.8+ Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://lore.kernel.org/r/20250514152424.149591-3-matthew.auld@intel.com	2025-05-29 11:56:01 +01:00
Matthew Auld	338ec84dee	drm/xe/bo: optimise CCS case for WB pages Dealing with CCS state is significant on LNL+, where we end up clearing the compression state on every page alloc using the blitter for user buffers, including also saving and restoring it when moving between domains, plus we need to alloc extra pages to hold the raw CCS state for the save step. However all compression PAT modes, on platforms like LNL, also require coh_none, meaning that only WC memory can use compression in the first place. With this we can be sneaky and completely ignore CCS for WB buffers, which is likely the common case anyway. This would then skip all blitter moves/clears between sys <-> tt and then also means we can drop the extra CCS pages. This should be safe since there is no way to interact with the compression state (potentially uncleared) without using a PAT enabled index (which is rejected at bind), including if trying to be malicious and copy the raw CCS state from userpace, which should give back all zeroes if the src surface (indirect) is lacking compressed PAT index. Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Satyanarayana K V P <satyanarayana.k.v.p@intel.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Reviewed-by: José Roberto de Souza <jose.souza@intel.com> Reviewed-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com> Link: https://lore.kernel.org/r/20250516153810.223530-2-matthew.auld@intel.com	2025-05-29 11:54:45 +01:00
Michal Wajdeczko	2cb38bb0ad	drm/xe: Allow to trigger GT resets using debugfs writes Today we allow to trigger GT resest by reading dedicated debugfs files "force_reset" and "force_reset_sync" that we are exposing using drm_info_list[] and drm_debugfs_create_files(). To avoid triggering potentially disruptive actions during otherwise "safe" read operations, expose those two attributes using debugfs function where we can specify file permissions and provide custom "write" handler to trigger the GT resets also from there. This step would allow us to drop triggering GT resets during read operations, which we leave just to give users more time to switch. Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Lucas De Marchi <lucas.demarchi@intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://lore.kernel.org/r/20250519200914.216-1-michal.wajdeczko@intel.com	2025-05-28 20:13:18 +02:00
Himal Prasad Ghimiray	22eba3be8e	drm/xe/svm: Avoid duplicate eviction on get_pages() failure xe_svm_range_get_pages() already calls drm_gpusvm_range_evict() internally when it fails with -EOPNOTSUPP. Remove the eviction call in the caller to prevent duplicate handling. Fixes: `e0ff0d7cf9` ("drm/xe/svm: Refactor usage of drm_gpusvm* function in xe_svm") Cc: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://lore.kernel.org/r/20250526163907.1011529-1-himal.prasad.ghimiray@intel.com Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>	2025-05-28 15:44:34 +05:30
Rodrigo Vivi	39578fa404	drm/xe: Add missing documentation of rpa_freq While at it, already adjust the rpe_freq frequency, to highlight that both are calculated by PCODE at runtime. Fixes: `c6aac2fa77` ("drm/xe: Introduce the RPa information") Cc: Vinay Belgaumkar <vinay.belgaumkar@intel.com> Cc: Lucas De Marchi <lucas.demarchi@intel.com> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Reviewed-by: Vinay Belgaumkar <vinay.belgaumkar@intel.com> Link: https://lore.kernel.org/r/20250521165146.39616-4-rodrigo.vivi@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2025-05-27 15:25:06 -04:00
Rodrigo Vivi	af53f0fd99	drm/xe: Make xe_gt_freq part of the Documentation The documentation was created with the creation of the component, however it has never been actually shown in the actual Documentation. While doing this, fixes the identation style, to avoid new warnings while building htmldocs. Fixes: `bef52b5c7a` ("drm/xe: Create a xe_gt_freq component for raw management and sysfs") Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Link: https://lore.kernel.org/r/20250521165146.39616-3-rodrigo.vivi@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2025-05-27 15:25:05 -04:00
Tomasz Lis	20a07782da	drm/xe/vf: Fail migration recovery if fixups needed but platform not supported The post-migration recovery needs to be fully implemented for a specific platform in order to make continuation of workloads possible. New platforms introduce changes which affect the recovery procedure, and without a clear verification of support this leads to errors with no straight forward error message explaining the cause. This patch fixes that issue - it introduces a message to be logged when the current driver is known to not support the current platform. Wedging the driver immediately also decreases the amount of additional errors which would come afterwards if the driver continued operation. v2: Show the message during probe as well as during recovery; do not perform any recovery steps if the recovery is bound to fail v3: Use SRIOV-specific logging, fix typos v4: XE_DEBUG_SRIOV to XE_DEBUG check switch, to make testing more straightforward Signed-off-by: Tomasz Lis <tomasz.lis@intel.com> Cc: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Michał Winiarski <michal.winiarski@intel.com> Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Acked-by: Michał Winiarski <michal.winiarski@intel.com> Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Link: https://lore.kernel.org/r/20250519230035.3143966-1-tomasz.lis@intel.com	2025-05-22 12:04:09 +02:00
Matt Atwood	49c6dc74b5	drm/xe/ptl: Update the PTL pci id table Update to current bspec table. Bspec: 72574 Signed-off-by: Matt Atwood <matthew.s.atwood@intel.com> Reviewed-by: Tejas Upadhyay <tejas.upadhyay@intel.com> Reviewed-by: Clint Taylor <Clinton.A.Taylor@intel.com> Link: https://lore.kernel.org/r/20250520195749.371748-1-matthew.s.atwood@intel.com Signed-off-by: Matt Roper <matthew.d.roper@intel.com>	2025-05-21 13:48:25 -07:00
Shuicheng Lin	d2662cf8f4	drm/xe: Use xe_mmio_read32() to read mtcfg register The mtcfg register is a 32-bit register and should therefore be accessed using xe_mmio_read32(). Other 3 changes per codestyle suggestion: " xe_mmio.c:83: CHECK: Alignment should match open parenthesis xe_mmio.c:131: CHECK: Comparison to NULL could be written "!xe->mmio.regs" xe_mmio.c:315: CHECK: line length of 103 exceeds 100 columns " Fixes: `dd08ebf6c3` ("drm/xe: Introduce a new DRM driver for Intel GPUs") Reviewed-by: Tejas Upadhyay <tejas.upadhyay@intel.com> Cc: Matt Roper <matthew.d.roper@intel.com> Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com> Link: https://lore.kernel.org/r/20250513153010.3464767-1-shuicheng.lin@intel.com Signed-off-by: Matt Roper <matthew.d.roper@intel.com>	2025-05-19 12:18:56 -07:00
Aradhya Bhatia	a7f87deac2	drm/xe: Default auto_link_downgrade status to false xe_pcode_read() can return back successfully without updating the variable 'val'. This can cause an arbitrary value to show up in the sysfs file. Allow the auto_link_downgrade_status to default to 0 to avoid any arbitrary value from coming up. Fixes: `0e414bf7ad` ("drm/xe: Expose PCIe link downgrade attributes") Reviewed-by: Tejas Upadhyay <tejas.upadhyay@intel.com> Signed-off-by: Aradhya Bhatia <aradhya.bhatia@intel.com> Link: https://lore.kernel.org/r/20250516124355.4872-1-aradhya.bhatia@intel.com Signed-off-by: Matt Roper <matthew.d.roper@intel.com>	2025-05-19 09:33:52 -07:00
Aradhya Bhatia	17486cf3df	drm/xe/guc: Make creation of SLPC debugfs files conditional Platforms that do not support SLPC are exempted from the GuC PC support. The GuC PC does not get initialized, and neither do its BOs get created. This causes a problem because the GuC PC debugfs file is still being created. Whenever the file is attempted to read, it causes a NULL pointer dereference on the supposed BO of the GuC PC. So, make the creation of SLPC debugfs files conditional to when SLPC features are supported. Fixes: `aaab5404b1` ("drm/xe: Introduce GuC PC debugfs") Suggested-by: Matt Roper <matthew.d.roper@intel.com> Reviewed-by: Tejas Upadhyay <tejas.upadhyay@intel.com> Reviewed-by: Stuart Summers <stuart.summers@intel.com> Signed-off-by: Aradhya Bhatia <aradhya.bhatia@intel.com> Link: https://lore.kernel.org/r/20250516141902.5614-1-aradhya.bhatia@intel.com Signed-off-by: Matt Roper <matthew.d.roper@intel.com>	2025-05-19 09:00:05 -07:00
Tejas Upadhyay	a383cf218e	drm/xe/mocs: Check if all domains awake Check if all domains are awake specially for LNCF regs Fixes: `1182bc74b3` ("drm/xe: Fix MOCS debugfs LNCF readout") Improvements-suggested-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com> Reviewed-by: Badal Nilawar <badal.nilawar@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250506142300.1865783-1-tejas.upadhyay@intel.com Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>	2025-05-16 16:51:50 +05:30
Piotr Piórkowski	921ddb37d8	drm/xe/pf: Don't allow LMEM provisioning if LMTT isn't available on the device The LMEM provisioning is applicable only on platforms with LMTT. v2: - new commit description - use xe_gt_assert in xe_gt_sriov_pf_config_set_lmem instead return error, - disable pf_lmem_info if LMTT is not available v3: fix condition in xe_gt_assert v4: rebase Signed-off-by: Piotr Piórkowski <piotr.piorkowski@intel.com> Cc: Michal Wajdeczko <michal.wajdeczko@intel.com> Reviewed-by: Stuart Summers <stuart.summers@intel.com> Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Link: https://lore.kernel.org/r/20250513071321.700464-1-piotr.piorkowski@intel.com	2025-05-16 13:11:01 +02:00
John Harrison	16b7e65d29	drm/xe/guc: Track FAST_REQ H2Gs to report where errors came from Most H2G messages are FAST_REQ which means no synchronous response is expected. The messages are sent as fire-and-forget with no tracking. However, errors can still be returned when something goes unexpectedly wrong. That leads to confusion due to not being able to match up the error response to the originating H2G. So add support for tracking the FAST_REQ H2Gs and matching up an error response to its originator. This is only enabled in XE_DEBUG builds given that such errors should never happen in a working system and there is an overhead for the tracking. Further, if XE_DEBUG_GUC is enabled then even more memory and time is used to record the call stack of each H2G and report that with the error. That makes it much easier to work out where a specific H2G came from if there are multiple code paths that can send it. v2: Some re-wording of comments and prints, more consistent use of #if vs stub functions - review feedback from Daniele & Michal). v3: Split config change to separate patch, improve a debug print (review feedback from Michal). v4: Bunch of minor tweaks (review feedback from Michal). Original-i915-code: Michal Wajdeczko <michal.wajdeczko@intel.com> Signed-off-by: John Harrison <John.C.Harrison@Intel.com> Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Link: https://lore.kernel.org/r/20250512215324.1457009-5-John.C.Harrison@Intel.com	2025-05-15 12:27:37 -07:00
John Harrison	d7d97890e2	drm/xe/guc: Rename CONFIG_XE_LARGE_GUC_BUFFER Rename XE_LARGE_GUC_BUFFER to XE_DEBUG_GUC to allow for more debug only code (in subsequent patch) without adding more config defines that each control only a single thing. Signed-off-by: John Harrison <John.C.Harrison@Intel.com> Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Link: https://lore.kernel.org/r/20250512215324.1457009-4-John.C.Harrison@Intel.com	2025-05-15 12:27:36 -07:00
John Harrison	12373b30e2	drm/xe/guc: Add missing H2G error code definitions These error codes are not actually used in the driver but it is extremely useful to have them available to understand error messages. v2: Add a bunch more error codes and drop 'status' from names (review feedback by Michal W). v3: Drop 'SUCCESS' response as meaningless in current API (review feedback by Michal W). Signed-off-by: John Harrison <John.C.Harrison@Intel.com> Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Link: https://lore.kernel.org/r/20250512215324.1457009-3-John.C.Harrison@Intel.com	2025-05-15 12:27:34 -07:00
John Harrison	fddf8cdd4b	drm/xe/guc: Remove double blank line An earlier patch moved a drm_print a few lines lower but accidentally left a double blank line behind. So fix that. Signed-off-by: John Harrison <John.C.Harrison@Intel.com> Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Link: https://lore.kernel.org/r/20250512215324.1457009-2-John.C.Harrison@Intel.com	2025-05-15 12:27:33 -07:00
Lucas De Marchi	eaa287069a	drm/xe/guc_submit: Simplify and fix diff calculation With a u32 type, there's no need to check which one is greater: the current is always the latest and if it's less than the previous, it's because it wrapped: just do the unsigned calculation that will lead to the same result, or better the correct one. It fixes an off-by-one in the wrapped calculation, however that doesn't really matter for the timeout calculation. Reviewed-by: Raag Jadav <raag.jadav@intel.com> Reviewed-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://lore.kernel.org/r/20250513-time-wrap-v1-1-fba9a69a65c8@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2025-05-15 06:14:17 -07:00
Michal Wajdeczko	3dbab383e3	drm/xe/guc: Don't allocate managed BO for each policy change We shouldn't use xe_managed_bo_create_from_data() to allocate temporary BO, as it will be released only on unload and every change in wedge_mode policy will consume resources (including precious GGTT). Instead just switchover to GuC buffer cache. Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: John Harrison <John.C.Harrison@Intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://lore.kernel.org/r/20250512220018.172-3-michal.wajdeczko@intel.com	2025-05-15 12:29:55 +02:00
Michal Wajdeczko	b86babc9d9	drm/xe/guc: Unblock GuC buffer cache for all modes Today we were using GuC buffer cache only in the PF mode, but shortly we will want to use it also in native and VF mode. Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Link: https://lore.kernel.org/r/20250512220018.172-2-michal.wajdeczko@intel.com	2025-05-15 12:29:54 +02:00
Himal Prasad Ghimiray	5aee6e33e1	drm/xe/vm: Add debug prints for SVM range prefetch Introduce debug logs for the prefetch operation of SVM ranges. Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://lore.kernel.org/r/20250513040228.470682-16-himal.prasad.ghimiray@intel.com Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>	2025-05-14 19:25:54 +05:30
Himal Prasad Ghimiray	09ba0a8f06	drm/xe/svm: Implement prefetch support for SVM ranges This commit adds prefetch support for SVM ranges, utilizing the existing ioctl vm_bind functionality to achieve this. v2: rebase v3: - use xa_for_each() instead of manual loop - check range is valid and in preferred location before adding to xarray - Fix naming conventions - Fix return condition as -ENODATA instead of -EAGAIN (Matthew Brost) - Handle sparsely populated cpu vma range (Matthew Brost) v4: - fix end address to find next cpu vma in case of -ENOENT v5: - Move find next vma logic to drm gpusvm layer - Avoid mixing declaration and logic v6: - Use new function names - Move eviction logic to prefetch_ranges v7: - devmem_only assigned 0 - nit address v8: - initialize ctx with 0 Cc: Matthew Brost <matthew.brost@intel.com> Acked-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://lore.kernel.org/r/20250513040228.470682-15-himal.prasad.ghimiray@intel.com Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>	2025-05-14 19:25:54 +05:30
Himal Prasad Ghimiray	c904d4e2d7	drm/xe/svm: Add xe_svm_find_vma_start() helper Add helper xe_svm_find_vma_start() function to determine start of cpu vma in input range. Reviewed-by: Matthew Brost <matthew.brost@intel.com> Acked-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Link: https://lore.kernel.org/r/20250513040228.470682-14-himal.prasad.ghimiray@intel.com Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>	2025-05-14 19:25:54 +05:30
Himal Prasad Ghimiray	72fa870957	drm/gpusvm: Introduce drm_gpusvm_find_vma_start() function The drm_gpusvm_find_vma_start() function is used to determine the starting address of a CPU VMA within a specified user range. If the range does not contain any VMA, the function returns ULONG_MAX. v2 - Rename function as drm_gpusvm_find_vma_start() (Matthew Brost) - mmget/mmput v3 - s/mmget/mmget_not_zero/ Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://lore.kernel.org/r/20250513040228.470682-13-himal.prasad.ghimiray@intel.com Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>	2025-05-14 19:25:54 +05:30
Himal Prasad Ghimiray	6275362f18	drm/xe/svm: Add xe_svm_range_validate() and xe_svm_range_migrate_to_smem() The xe_svm_range_validate() function checks if a range is valid and located in the desired memory region. xe_svm_range_migrate_to_smem() checks if range have pages in devmem and migrate them to smem. v2 - Fix function stub in xe_svm.h - Fix doc v3 (Matthew Brost) - Remove extra new line - s/range->base.flags.has_devmem_pages/xe_svm_range_in_vram v4 (Matthew Brost) - s/xe_svm_range_in_vram/range->base.flags.has_devmem_pages - Move eviction logic to separate function Reviewed-by: Matthew Brost <matthew.brost@intel.com> Acked-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Link: https://lore.kernel.org/r/20250513040228.470682-12-himal.prasad.ghimiray@intel.com Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>	2025-05-14 19:25:54 +05:30
Himal Prasad Ghimiray	cc795e0410	drm/xe/svm: Make xe_svm_range_needs_migrate_to_vram() public xe_svm_range_needs_migrate_to_vram() determines whether range needs migration to vram or not, modify it to accept region preference parameter too, so we can use it in prefetch too. v2 - add assert instead of warn (Matthew Brost) Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://lore.kernel.org/r/20250513040228.470682-11-himal.prasad.ghimiray@intel.com Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>	2025-05-14 19:25:54 +05:30
Himal Prasad Ghimiray	e0ff0d7cf9	drm/xe/svm: Refactor usage of drm_gpusvm* function in xe_svm Define xe_svm_range_find_or_insert function wrapping drm_gpusvm_range_find_or_insert for reusing in prefetch. Define xe_svm_range_get_pages function wrapping drm_gpusvm_range_get_pages for reusing in prefetch. -v2 pass pagefault defined drm_gpu_svm context as parameter in xe_svm_range_find_or_insert(Matthew Brost) Cc: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Acked-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Link: https://lore.kernel.org/r/20250513040228.470682-10-himal.prasad.ghimiray@intel.com Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>	2025-05-14 19:25:54 +05:30
Himal Prasad Ghimiray	da05e5ddc6	drm/xe: Rename lookup_vma function to xe_find_vma_by_addr This update renames the lookup_vma function to xe_vm_find_vma_by_addr and makes it accessible externally. The function, which looks up a VMA by its address within a specified VM, will be utilized in upcoming patches. v2 - Fix doc Reviewed-by: Matthew Brost <matthew.brost@intel.com> Acked-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Link: https://lore.kernel.org/r/20250513040228.470682-9-himal.prasad.ghimiray@intel.com Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>	2025-05-14 19:25:54 +05:30
Himal Prasad Ghimiray	bd1d1b46fe	drm/xe/vm: Add an identifier in xe_vma_ops for svm prefetch Add a flag in xe_vma_ops to determine whether it has svm prefetch ops or not. v2: - s/false/0 (Matthew Brost) v3: - s/XE_VMA_OPS_HAS_SVM_PREFETCH/XE_VMA_OPS_FLAG_HAS_SVM_PREFETCH Suggested-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Acked-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Link: https://lore.kernel.org/r/20250513040228.470682-8-himal.prasad.ghimiray@intel.com Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>	2025-05-14 19:25:54 +05:30
Himal Prasad Ghimiray	34ebb62723	drm/xe/vm: Update xe_vma_ops_incr_pt_update_ops to take an increment value Prefetch for SVM ranges can have more than one operation to increment, hence modify the function to accept an increment value as input. v2: - Call xe_vma_ops_incr_pt_update_ops only once for REMAP (Matthew Brost) - Add check for 0 ops v3: - s/u8/int for inc_val and num_remap_ops (Matthew Brost) Suggested-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Acked-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Link: https://lore.kernel.org/r/20250513040228.470682-7-himal.prasad.ghimiray@intel.com Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>	2025-05-14 19:25:54 +05:30
Himal Prasad Ghimiray	da2eb41004	drm/xe/svm: Make xe_svm_range_* end/start/size public These functions will be used in prefetch too, therefore make them public. v2 - Fix kernel doc Reviewed-by: Matthew Brost <matthew.brost@intel.com> Acked-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Link: https://lore.kernel.org/r/20250513040228.470682-6-himal.prasad.ghimiray@intel.com Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>	2025-05-14 19:25:53 +05:30
Himal Prasad Ghimiray	18211ff4d5	drm/xe/svm: Make to_xe_range a public function The to_xe_range function will be used in other files. Therefore, make it public and add kernel-doc documentation Reviewed-by: Matthew Brost <matthew.brost@intel.com> Acked-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Link: https://lore.kernel.org/r/20250513040228.470682-5-himal.prasad.ghimiray@intel.com Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>	2025-05-14 19:25:53 +05:30
Himal Prasad Ghimiray	eb07c2fc10	drm/xe/svm: Helper to add tile masks to svm ranges Introduce a helper to add tile mask of binding present and invalidated for the range. Add a lockdep_assert to ensure it is protected by GPU SVM notifier lock. -v7 rebased Suggested-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://lore.kernel.org/r/20250513040228.470682-4-himal.prasad.ghimiray@intel.com Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>	2025-05-14 19:25:53 +05:30
Himal Prasad Ghimiray	686a526dad	drm/xe: Make xe_svm_alloc_vram public This function will be used in prefetch too, hence make it public. v2: - Add kernel-doc (Matthew Brost) - Rebase v3: - Move CONFIG_DRM_XE_DEVMEM_MIRROR stub out to xe_svm.c (Matthew Brost) Reviewed-by: Matthew Brost <matthew.brost@intel.com> Acked-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Link: https://lore.kernel.org/r/20250513040228.470682-3-himal.prasad.ghimiray@intel.com Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>	2025-05-14 19:25:53 +05:30
Himal Prasad Ghimiray	745df157e4	drm/xe: Introduce xe_vma_op_prefetch_range struct for prefetch of ranges Add xe_vma_op_prefetch_range struct for svm ranges prefetching, including an xarray of SVM range pointers, range count, and target memory region. -v2: Fix doc Reviewed-by: Matthew Brost <matthew.brost@intel.com> Acked-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Link: https://lore.kernel.org/r/20250513040228.470682-2-himal.prasad.ghimiray@intel.com Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>	2025-05-14 19:25:53 +05:30
Umesh Nerlige Ramappa	82b98cadb0	drm/xe: Add WA BB to capture active context utilization Context Timestamp (CTX_TIMESTAMP) in the LRC accumulates the run ticks of the context, but only gets updated when the context switches out. In order to check how long a context has been active before it switches out, two things are required: (1) Determine if the context is running: To do so, we program the WA BB to set an initial value for CTX_TIMESTAMP in the LRC. The value chosen is 1 since 0 is the initial value when the LRC is initialized. During a query, we just check for this value to determine if the context is active. If the context switched out, it would overwrite this location with the actual CTX_TIMESTAMP MMIO value. Note that WA BB runs as the last part of the context restore, so reusing this LRC location will not clobber anything. (2) Calculate the time that the context has been active for: The CTX_TIMESTAMP ticks only when the context is active. If a context is active, we just use the CTX_TIMESTAMP MMIO as the new value of utilization. While doing so, we need to read the CTX_TIMESTAMP MMIO for the specific engine instance. Since we do not know which instance the context is running on until it is scheduled, we also read the ENGINE_ID MMIO in the WA BB and store it in the PPHSWP. Using the above 2 instructions in a WA BB, capture active context utilization. v2: (Matt Brost) - This breaks TDR, fix it by saving the CTX_TIMESTAMP register "drm/xe: Save CTX_TIMESTAMP mmio value instead of LRC value" - Drop tile from LRC if using gt "drm/xe: Save the gt pointer in LRC and drop the tile" v3: - Remove helpers for bb_per_ctx_ptr (Matt) - Add define for context active value (Matt) - Use 64 bit CTX TIMESTAMP for platforms that support it. For platforms that don't, live with the rare race. (Matt, Lucas) - Convert engine id to hwe and get the MMIO value (Lucas) - Correct commit message on when WA BB runs (Lucas) v4: - s/GRAPHICS_VER(...)/xe->info.has_64bit_timestamp/ (Matt) - Drop support for active utilization on a VF (CI failure) - In xe_lrc_init ensure the lrc value is 0 to begin with (CI regression) v5: - Minor checkpatch fix - Squash into previous commit and make TDR use 32-bit time - Update code comment to match commit msg Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/4532 Suggested-by: Lucas De Marchi <lucas.demarchi@intel.com> Signed-off-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Link: https://lore.kernel.org/r/20250509161159.2173069-8-umesh.nerlige.ramappa@intel.com	2025-05-12 14:33:25 -07:00

1 2 3 4 5 ...

1352056 Commits