When running ras uncorrectable error injection and triggering GPU
reset on sGPU, below issue is observed. It's caused by the list
uninitialized when accessing.
[ 80.047227] BUG: unable to handle page fault for address:
ffffffffc0f4f750
[ 80.047300] #PF: supervisor write access in kernel mode
[ 80.047351] #PF: error_code(0x0003) - permissions violation
[ 80.047404] PGD
12c20e067 P4D
12c20e067 PUD
12c210067 PMD
41c4ee067 PTE
404316061
[ 80.047477] Oops: 0003 [#1] SMP PTI
[ 80.047516] CPU: 7 PID: 377 Comm: kworker/7:2 Tainted: G OE 5.4.0-rc7-guchchen #1
[ 80.047594] Hardware name: System manufacturer System Product Name/TUF Z370-PLUS GAMING II, BIOS 0411 09/21/2018
[ 80.047888] Workqueue: events amdgpu_ras_do_recovery [amdgpu]
Signed-off-by: Guchun Chen <guchun.chen@amd.com>
Reviewed-by: John Clements <John.Clements@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
struct amdgpu_hive_info *hive = amdgpu_get_xgmi_hive(adev, false);
/* Build list of devices to query RAS related errors */
- if (hive && adev->gmc.xgmi.num_physical_nodes > 1) {
+ if (hive && adev->gmc.xgmi.num_physical_nodes > 1)
device_list_handle = &hive->device_list;
- } else {
+ else {
+ INIT_LIST_HEAD(&device_list);
list_add_tail(&adev->gmc.xgmi.head, &device_list);
device_list_handle = &device_list;
}