91 Commits

Author SHA1 Message Date
Ce Sun
505dcb8eea drm/amd/ras: Avoid ECC status update in hw_fini for VF unload
VF sends IDH_REQ_GPU_FINI_ACCESS before hw_fini during unload.
PF no longer accepts requests, so skip ECC status update to prevent
mailbox timeout.

Signed-off-by: Ce Sun <cesun102@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-04-17 14:49:51 -04:00
Lijo Lazar
4c616e8446 drm/amdgpu: Add host driver reserved-region
Use reserve region helpers for initializing/reserving host driver
reserved region in virtualization environment.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-04-03 13:50:01 -04:00
Lijo Lazar
daaf24d1fc drm/amdgpu: Add fw vram usage reserve-region
Use reserve region helpers for initializing/reserving firmware usage
region in virtualized environments.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-04-03 13:49:58 -04:00
Ce Sun
8d45d88e0b drm/amd/ras: enable uniras via IP version check
enable uniras via IP version check

Signed-off-by: Ce Sun <cesun102@amd.com>
Reviewed-by: YiPeng Chai <YiPeng.Chai@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-04-03 13:49:23 -04:00
Srinivasan Shanmugam
2ccdefeec7 drm/amd/ras: Add input pointer validation in ras core helpers
Add NULL checks for helper input/output pointers that are directly
dereferenced, such as tm, seqno, dev_info and init_config.

Cc: Tao Zhou <tao.zhou1@amd.com>
Cc: YiPeng Chai <YiPeng.Chai@amd.com>
Cc: Dan Carpenter <dan.carpenter@linaro.org>
Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Reviewed-by: YiPeng Chai <YiPeng.Chai@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-03-23 14:08:01 -04:00
Srinivasan Shanmugam
e495c688ed drm/amd/ras: Add NULL checks for ras_core sys_fn callbacks
Some ras core helper functions access ras_core and its callback
table (sys_fn) without validating them first.

Cc: Tao Zhou <tao.zhou1@amd.com>
Cc: YiPeng Chai <YiPeng.Chai@amd.com>
Cc: Dan Carpenter <dan.carpenter@linaro.org>
Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Reviewed-by: YiPeng Chai <YiPeng.Chai@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-03-23 14:06:07 -04:00
Srinivasan Shanmugam
fd490bb9e1 drm/amd/ras: Remove redundant NULL check in pending bad-bank list iteration
ras_umc_log_pending_bad_bank() walks through a list of pending ECC
bad-bank entries. These entries are saved when a bad-bank error cannot
be processed immediately, for example during a GPU reset.

Later, this function iterates over the pending list and retries logging
each bad-bank error. If logging succeeds, the entry is removed from the
list and the memory for that node is freed.

The loop uses list_for_each_entry_safe(), which already guarantees that
ecc_node points to a valid list entry while the loop body is executing.

Checking "ecc_node &&" inside the loop is therefore unnecessary and
redundant.

Fixes the below:
drivers/gpu/drm/amd/amdgpu/../ras/rascore/ras_umc.c:225 ras_umc_log_pending_bad_bank() warn: variable dereferenced before check 'ecc_node' (see line 223)

Fixes: 7a3f9c0992 ("drm/amd/ras: Add umc common ras functions")
Cc: Dan Carpenter <dan.carpenter@linaro.org>
Cc: YiPeng Chai <YiPeng.Chai@amd.com>
Cc: Tao Zhou <tao.zhou1@amd.com>
Cc: Hawking Zhang <Hawking.Zhang@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Reviewed-by: YiPeng Chai <YiPeng.Chai@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-03-23 14:02:58 -04:00
Srinivasan Shanmugam
2b8101cc3b drm/amd/ras: Fix NULL deref in ras_core_get_utc_second_timestamp()
ras_core_get_utc_second_timestamp() retrieves the current UTC timestamp
(in seconds since the Unix epoch) through a platform-specific RAS system
callback and is used for timestamping RAS error events.

The function checks ras_core in the conditional statement before calling
the sys_fn callback. However, when the condition fails, the function
prints an error message using ras_core->dev.

If ras_core is NULL, this can lead to a potential NULL pointer
dereference when accessing ras_core->dev.

Add an early NULL check for ras_core at the beginning of the function
and return 0 when the pointer is not valid. This prevents the
dereference and makes the control flow clearer.

Fixes: 13c91b5b43 ("drm/amd/ras: Add rascore unified interface function")
Cc: YiPeng Chai <YiPeng.Chai@amd.com>
Cc: Dan Carpenter <dan.carpenter@linaro.org>
Cc: Tao Zhou <tao.zhou1@amd.com>
Cc: Hawking Zhang <Hawking.Zhang@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Reviewed-by: YiPeng Chai <YiPeng.Chai@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-03-17 17:42:09 -04:00
Srinivasan Shanmugam
6b606216e0 drm/amd/ras: Fix NULL deref in ras_core_ras_interrupt_detected()
Fixes a NULL pointer dereference when ras_core is NULL and ras_core->dev
is accessed in the error path.

Fixes: 13c91b5b43 ("drm/amd/ras: Add rascore unified interface function")
Reported by: Dan Carpenter <dan.carpenter@linaro.org>
Cc: YiPeng Chai <YiPeng.Chai@amd.com>
Cc: Tao Zhou <tao.zhou1@amd.com>
Cc: Hawking Zhang <Hawking.Zhang@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-03-17 10:47:17 -04:00
YiPeng Chai
a0b2afa4c3 drm/amd/ras: Pass ras poison consumption message to sriov host
Pass ras poison consumption message to sriov host.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-03-17 10:37:51 -04:00
YiPeng Chai
22664436e6 drm/amd/ras: Add unified interface to handle ras interrupts
Add unified interface to handle ras interrupts, some redundant
interrupt function interfaces will be removed later.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-03-17 10:33:04 -04:00
Gangliang Xie
a4f87a1c30 drm/amd/ras: adapt sync info func for pmfw eeprom
adapt sync info func for pmfw eeprom

Signed-off-by: Gangliang Xie <ganglxie@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-03-04 11:43:46 -05:00
Gangliang Xie
e82f9aac39 drm/amd/ras: add check func for pmfw eeprom
add check func for pmfw eeprom

Signed-off-by: Gangliang Xie <ganglxie@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-03-04 11:43:43 -05:00
Gangliang Xie
89b814a34a drm/amd/ras: add initialization func for pmfw eeprom
add initialization func for pmfw eeprom

Signed-off-by: Gangliang Xie <ganglxie@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-03-04 11:42:23 -05:00
Gangliang Xie
72289903a2 drm/amd/ras: adapt page retirement process for pmfw eeprom
read bad page data from pmfw eeprom when retirement
is triggered, use timestamp read from eeprom

Signed-off-by: Gangliang Xie <ganglxie@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-03-04 11:42:20 -05:00
Gangliang Xie
42c46be2ec drm/amd/ras: add read func for pmfw eeprom
add read func for pmfw eeprom, and adapt address converting
for bad pages loaded from pmfw eeprom

v2: change label 'Out' to 'out'

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Gangliang Xie <ganglxie@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-03-04 11:42:04 -05:00
Tao Zhou
3972f41bc1 drm/amd/ras: make MCA IPID parse global
add a new IPID parse interface for umc, so we can
implement it for each ASIC, and so we can call it
in other blocks

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Gangliang Xie <ganglxie@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-03-04 11:42:00 -05:00
Gangliang Xie
b18475fba3 drm/amd/ras: add append func for pmfw eeprom
add append func for pmfw eeprom

Signed-off-by: Gangliang Xie <ganglxie@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-03-04 11:41:56 -05:00
Gangliang Xie
5042806557 drm/amd/ras: add check safety watermark func for pmfw eeprom
add check safety watermark func for pmfw eeprom

Signed-off-by: Gangliang Xie <ganglxie@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-03-04 11:41:43 -05:00
Gangliang Xie
db217d08af drm/amd/ras: Add table reset func for pmfw eeprom
add table reset func for pmfw eeprom, add smu eeprom control
structure

Signed-off-by: Gangliang Xie <ganglxie@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-03-02 16:46:58 -05:00
Gangliang Xie
0d21084c02 drm/amd/ras: add wrapper funcs for pmfw eeprom
add wrapper funcs for pmfw eeprom interface to make them
easier to be called

Signed-off-by: Gangliang Xie <ganglxie@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-03-02 16:43:40 -05:00
Gangliang Xie
12f4d4482a drm/amd/ras: add uniras smu feature flag init func
add flag to indicate if pmfw eeprom is supported or
not, and initialize it

v2: change copyright from 2025 to 2026

Signed-off-by: Gangliang Xie <ganglxie@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-03-02 16:43:28 -05:00
Gangliang Xie
689b03a0a2 drm/amd/ras: add pmfw eeprom smu interfaces
add smu interfaces and its data structures for
pmfw eeprom in uniras

v2: add 'const' to smu messages array, and specify
    index for each member when initializing.

Signed-off-by: Gangliang Xie <ganglxie@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Yang Wang <kevinyang.wang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-03-02 16:43:25 -05:00
YiPeng Chai
7783a8bf06 drm/amdgpu: Fix static assertion failure issue
Since the PAGE_SIZE is 8KB on sparc64, the size of
structure amdsriov_ras_telemetry will exceed 64KB,
so use absolute value to fix the buffer size.

Fixes the issue:
 drivers/gpu/drm/amd/amdgpu/amdgv_sriovmsg.h:522:2: error: static
 assertion failed due to requirement 'sizeof(struct
 amdsriov_ras_telemetry) <= 64 << 10': amdsriov_ras_telemetry must be 64 KB
 |  sizeof(struct amdsriov_ras_telemetry) <=
	AMD_SRIOV_MSG_RAS_TELEMETRY_SIZE_KB_V1 << 10,
 drivers/gpu/drm/amd/amdgpu/amdgv_sriovmsg.h:522:40: note:
	expression evaluates to '115616 <= 65536'
 |   sizeof(struct amdsriov_ras_telemetry) <=
	AMD_SRIOV_MSG_RAS_TELEMETRY_SIZE_KB_V1 << 10,

Fixes: cb48a6b2b6 ("drm/amd/ras: use dedicated memory as vf ras command buffer")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202602261700.rVOLIw4l-lkp@intel.com/
Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-03-02 16:39:27 -05:00
Yujie Liu
74c2689eea drm/amd/ras: fix kernel-doc warning for ras_eeprom_append()
Warning: drivers/gpu/drm/amd/amdgpu/../ras/rascore/ras_eeprom.c:845 function parameter 'ras_core' not described in 'ras_eeprom_append'
Warning: drivers/gpu/drm/amd/amdgpu/../ras/rascore/ras_eeprom.c:845 expecting prototype for ras_core_eeprom_append(). Prototype was for ras_eeprom_append() instead

Fixes: 5c3be5defc ("drm/amd/ras: Add eeprom ras functions")
Signed-off-by: Yujie Liu <yujie.liu@intel.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-03-02 16:36:37 -05:00
Kees Cook
9f3d543a9f drm/amd/ras: Fix type size of remainder argument
Forcing an int to be dereferenced at uint64_t for div64_u64_rem() runs
the risk of endian confusion and stack overflowing writes. Seen while
preparing to enable -Warray-bounds globally:

In file included from ../arch/x86/include/asm/processor.h:35,
                 from ../include/linux/sched.h:13,
                 from ../include/linux/ratelimit.h:6,
                 from ../include/linux/dev_printk.h:16,
                 from ../drivers/gpu/drm/amd/amdgpu/../ras/ras_mgr/ras_sys.h:29,
                 from ../drivers/gpu/drm/amd/amdgpu/../ras/rascore/ras.h:27,
                 from ../drivers/gpu/drm/amd/amdgpu/../ras/rascore/ras_core.c:24:
In function 'div64_u64_rem',
    inlined from 'ras_core_convert_timestamp_to_time' at ../drivers/gpu/drm/amd/amdgpu/../ras/rascore/ras_core.c:72:9:
../include/linux/math64.h:56:20: error: array subscript 'u64 {aka long long unsigned int}[0]' is partly outside array bounds of 'int[1]' [-Werror=array-bounds=]
   56 |         *remainder = dividend % divisor;
      |         ~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~
../drivers/gpu/drm/amd/amdgpu/../ras/rascore/ras_core.c: In function 'ras_core_convert_timestamp_to_time':
../drivers/gpu/drm/amd/amdgpu/../ras/rascore/ras_core.c:70:19: note: object 'remaining_seconds' of size 4
   70 |         int days, remaining_seconds;
      |                   ^~~~~~~~~~~~~~~~~

Use a 64-bit type for the remainder calculation, but leave
remaining_seconds as 32-bit to avoid 64-bit division later. The value of
remainder will always be less than seconds_per_day, so there's no
truncation risk.

Fixes: ace232eff5 ("drm/amdgpu: Add ras module files into amdgpu")
Signed-off-by: Kees Cook <kees@kernel.org>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-03-02 16:35:30 -05:00
Jinzhou Su
edd786e9cf drm/amd/ras: Add function to convert retired address
Add function to convert retired address in SR-IOV
guest.

Signed-off-by: Jinzhou Su <jinzhou.su@amd.com>
Reviewed-by: YiPeng Chai <YiPeng.Chai@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-02-25 16:56:12 -05:00
Jinzhou Su
a1fafed0f5 drm/amd/ras: Handle address check in SR-IOV guest
Handle address check validity command in SR-IOV guest

Signed-off-by: Jinzhou Su <jinzhou.su@amd.com>
Reviewed-by: YiPeng Chai <YiPeng.Chai@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-02-25 16:56:11 -05:00
Jinzhou Su
186eb92a1a drm/amd/ras: Add convert retired address structure
Add convert retired address command and structure
for uniras.

Signed-off-by: Jinzhou Su <jinzhou.su@amd.com>
Reviewed-by: YiPeng Chai <YiPeng.Chai@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-02-25 16:56:11 -05:00
Jinzhou Su
2480d3bf9e drm/amd/ras: Add address check structure
Add address check command and data structure
for uniras.

Signed-off-by: Jinzhou Su <jinzhou.su@amd.com>
Reviewed-by: YiPeng Chai <YiPeng.Chai@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-02-25 16:56:11 -05:00
YiPeng Chai
cb48a6b2b6 drm/amd/ras: use dedicated memory as vf ras command buffer
Use dedicated memory as vf ras command buffer.

V2:
  Add lock to ensure serialization of sending vf ras commands.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Jinzhou Su <jinzhou.su@amd.com>
Tested-by: Jinzhou Su <jinzhou.su@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-02-25 16:29:09 -05:00
Linus Torvalds
32a92f8c89 Convert more 'alloc_obj' cases to default GFP_KERNEL arguments
This converts some of the visually simpler cases that have been split
over multiple lines.  I only did the ones that are easy to verify the
resulting diff by having just that final GFP_KERNEL argument on the next
line.

Somebody should probably do a proper coccinelle script for this, but for
me the trivial script actually resulted in an assertion failure in the
middle of the script.  I probably had made it a bit _too_ trivial.

So after fighting that far a while I decided to just do some of the
syntactically simpler cases with variations of the previous 'sed'
scripts.

The more syntactically complex multi-line cases would mostly really want
whitespace cleanup anyway.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21 20:03:00 -08:00
Linus Torvalds
bf4afc53b7 Convert 'alloc_obj' family to use the new default GFP_KERNEL argument
This was done entirely with mindless brute force, using

    git grep -l '\<k[vmz]*alloc_objs*(.*, GFP_KERNEL)' |
        xargs sed -i 's/\(alloc_objs*(.*\), GFP_KERNEL)/\1)/'

to convert the new alloc_obj() users that had a simple GFP_KERNEL
argument to just drop that argument.

Note that due to the extreme simplicity of the scripting, any slightly
more complex cases spread over multiple lines would not be triggered:
they definitely exist, but this covers the vast bulk of the cases, and
the resulting diff is also then easier to check automatically.

For the same reason the 'flex' versions will be done as a separate
conversion.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21 17:09:51 -08:00
Kees Cook
69050f8d6d treewide: Replace kmalloc with kmalloc_obj for non-scalar types
This is the result of running the Coccinelle script from
scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to
avoid scalar types (which need careful case-by-case checking), and
instead replace kmalloc-family calls that allocate struct or union
object instances:

Single allocations:	kmalloc(sizeof(TYPE), ...)
are replaced with:	kmalloc_obj(TYPE, ...)

Array allocations:	kmalloc_array(COUNT, sizeof(TYPE), ...)
are replaced with:	kmalloc_objs(TYPE, COUNT, ...)

Flex array allocations:	kmalloc(struct_size(PTR, FAM, COUNT), ...)
are replaced with:	kmalloc_flex(*PTR, FAM, COUNT, ...)

(where TYPE may also be *VAR)

The resulting allocations no longer return "void *", instead returning
"TYPE *".

Signed-off-by: Kees Cook <kees@kernel.org>
2026-02-21 01:02:28 -08:00
Stanley.Yang
c40c94693c drm/amd/ras: statistic xgmi training error count
Report xgmi training error uncorrectable error count.

Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-02-05 17:17:48 -05:00
Jinzhou Su
0099f2e92c drm/amd/ras: Replace NPS flags in ras module
Replace AMDGPU_NPS8_PARTITION_MODE with
UMC_MEMORY_PARTITION_MODE_NPS8 to pass sriov
compilation.

Signed-off-by: Jinzhou Su <jinzhou.su@amd.com>
Reviewed-by: YiPeng Chai <YiPeng.Chai@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-05 17:00:00 -05:00
Jinzhou Su
d3336c935e drm/amd/ras: Support physical address convert
Support physical address convert to current NPS
pages in uniras.

Signed-off-by: Jinzhou Su <jinzhou.su@amd.com>
Reviewed-by: YiPeng Chai <YiPeng.Chai@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-05 16:26:12 -05:00
Candice Li
90254524ee drm/amd/ras: Add vram_type to ras_ta_init_flags
Add vram_type to ras_ta_init_flags.

Signed-off-by: Candice Li <candice.li@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-12-10 17:38:13 -05:00
Srinivasan Shanmugam
6ef93f6253 drm/amd/ras: Reduce stack usage in amdgpu_virt_ras_get_cper_records()
amdgpu_virt_ras_get_cper_records() was using a large stack array
of ras_log_info pointers. This contributed to the frame size
warning on this function.

Replace the fixed-size stack array:

    struct ras_log_info *trace[MAX_RECORD_PER_BATCH];

with a heap-allocated array using kcalloc().

We free the trace buffer together with out_buf on all exit paths.
If allocation of trace or out_buf fails, we return a generic RAS
error code.

This reduces stack usage and keeps the runtime behaviour
unchanged.

Fixes:
stack frame size: 1112 bytes (limit: 1024)

Cc: Tao Zhou <tao.zhou1@amd.com>
Cc: Hawking Zhang <Hawking.Zhang@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-12-10 17:37:50 -05:00
Srinivasan Shanmugam
8b971ce0cb drm/amd/ras: Reduce stack usage in ras_umc_handle_bad_pages()
ras_umc_handle_bad_pages() function used a large local array:
  struct eeprom_umc_record records[MAX_ECC_NUM_PER_RETIREMENT];

Move this array off the stack by allocating it with kcalloc()
and freeing it before return.

This reduces the stack frame size of ras_umc_handle_bad_pages()
and avoids the frame size warning.

Fixes the below:
drivers/gpu/drm/amd/amdgpu/../ras/rascore/ras_umc.c:498:5: warning: stack frame size (1208) exceeds limit (1024) in 'ras_umc_handle_bad_pages' [-Wframe-larger-than]

v2: Removed the duplicate ras_umc_get_new_records() invocation. (Lijo)

Cc: Tao Zhou <tao.zhou1@amd.com>
Cc: Hawking Zhang <Hawking.Zhang@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-12-08 14:25:37 -05:00
YiPeng Chai
54c2d9739b drm/amd/ras: Compatible with legacy sriov host
If sriov host is legacy, the guest uniras will
be disabled.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-12-08 13:56:34 -05:00
YiPeng Chai
73c6c22694 drm/amd/ras: Add sriov ras preprocessing before gpu reset
Sriov host may clear all VF commands registered to auto
update list during VF reset, set ecc.auto_uUpdate block
to false before VF reset, and after VF reset is complete,
RAS_CMD__GET_ALL_BLOCK_ECC_STATUS command will be re-registered
to auto update list of sriov host.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-12-08 13:56:34 -05:00
YiPeng Chai
11dcf72eb5 drm/amd/ras: Support high-frequency querying sriov ras block error count
Support high-frequency querying sriov ras block error count:
1. Create shared memory and fills it with RAS_CMD__GET_LAL_LOC_STATUS
   ras command.
2. The RAS_CMD_GET_ALL_BLOCK_ECC_STATUS command and shared
   memory are registered to sriov host ras auto-update list
   via RAS_CMD_SET_CMD_AUTO_UPDATE command.
3. Once sriov host detects ras error, it will automatically execute
   RAS_CMD__GET_ALL_BLOCK_ECC_STATUS command and write the result to
   shared memory.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-12-08 13:56:33 -05:00
YiPeng Chai
fcfa8dbb08 drm/amd/ras: Add ras command to retrieve cper data from sriov host
In order to reduce the number of interactions with sriov
host and the amount of data exchanged, a set of ras commands
is first used to obtain the raw data used to generate cper
from the host, then, guest driver generates cper based
on the obtained raw data.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-12-08 13:56:33 -05:00
YiPeng Chai
1855fc6b09 drm/amd/ras: sriov supports handling VF ras commands.
Add basic framework code to sriov to handle
VF ras commands.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-12-08 13:56:33 -05:00
YiPeng Chai
127cdd726f drm/amd/ras: ras supports i2c eeprom for mp1 v13_0_12
ras supports i2c eeprom for mp1 v13_0_12.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-11-06 09:57:07 -05:00
YiPeng Chai
d95ca7f515 drm/amdgpu: suspend ras module before gpu reset
During gpu reset, all GPU-related resources are
inaccessible. To avoid affecting ras functionality,
suspend ras module before gpu reset and resume
it after gpu reset is complete.

V2:
  Rename functions to avoid misunderstanding.

V3:
  Move flush_delayed_work to amdgpu_ras_process_pause,
  Move schedule_delayed_work to amdgpu_ras_process_unpause.

V4:
  Rename functions.

V5:
  Move the function to amdgpu_ras.c.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Acked-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-11-04 11:53:59 -05:00
YiPeng Chai
3f16007d86 drm/amd/ras: Add ras support for umc v12_5_0
Add ras support for umc v12_5_0.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-11-04 11:53:22 -05:00
YiPeng Chai
d7f105a402 drm/amd/ras: Add ras support for nbio v7_9_1
Add ras support for nbio v7_9_1.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-11-04 11:53:22 -05:00
YiPeng Chai
36265d2bcc drm/amd/ras: Increase ras switch control range
Increase ras switch control range.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-11-04 11:53:22 -05:00