drm/xe: fix mem_access for early lrc generation
authorMatthew Auld <matthew.auld@intel.com>
Mon, 27 Nov 2023 09:44:59 +0000 (09:44 +0000)
committerRodrigo Vivi <rodrigo.vivi@intel.com>
Thu, 21 Dec 2023 16:45:07 +0000 (11:45 -0500)
commitfcf98d68c00216b61b034f4d164e5c3074db636a
tree31efb42d8000b1c024d065d862c4559ee4201717
parent5152234e2e7a1d5b0897733f84597df23cde98b1
drm/xe: fix mem_access for early lrc generation

We spawn some hw queues during device probe to generate the default LRC
for every engine type, however the queue destruction step is typically
async. Queue destruction needs to do stuff like GuC context deregister
which requires GuC CT, which in turn requires an active mem_access ref.
The caller during probe is meant to hold the mem_access token, however
due to the async destruction it might have already been dropped if we
are unlucky.

Similar to how we already handle migrate VMs for which there is no
mem_access ref, fix this by keeping the callers token alive, releasing
it only when destroying the queue. We can treat a NULL vm as indication
that we need to grab our own extra ref.

Fixes the following splat sometimes seen during load:

[ 1682.899930] WARNING: CPU: 1 PID: 8642 at drivers/gpu/drm/xe/xe_device.c:537 xe_device_assert_mem_access+0x27/0x30 [xe]
[ 1682.900209] CPU: 1 PID: 8642 Comm: kworker/u24:97 Tainted: G     U  W   E    N 6.6.0-rc3+ #6
[ 1682.900214] Workqueue: submit_wq xe_sched_process_msg_work [xe]
[ 1682.900303] RIP: 0010:xe_device_assert_mem_access+0x27/0x30 [xe]
[ 1682.900388] Code: 90 90 90 66 0f 1f 00 0f 1f 44 00 00 53 48 89 fb e8 1e 6c 03 00 48 85 c0 74 06 5b c3 cc cc cc cc 8b 83 28 23 00 00 85 c0 75 f0 <0f> 0b 5b c3 cc cc cc cc 90 90 90 90 90 90 90 90 90 90 90 90 90 90
[ 1682.900390] RSP: 0018:ffffc900021cfb68 EFLAGS: 00010246
[ 1682.900394] RAX: 0000000000000000 RBX: ffff8886a96d8000 RCX: 0000000000000000
[ 1682.900396] RDX: 0000000000000001 RSI: ffff8886a6311a00 RDI: ffff8886a96d8000
[ 1682.900398] RBP: ffffc900021cfcc0 R08: 0000000000000001 R09: 0000000000000000
[ 1682.900400] R10: ffffc900021cfcd0 R11: 0000000000000002 R12: 0000000000000004
[ 1682.900402] R13: 0000000000000000 R14: ffff8886a6311990 R15: ffffc900021cfd74
[ 1682.900405] FS:  0000000000000000(0000) GS:ffff888829880000(0000) knlGS:0000000000000000
[ 1682.900407] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1682.900409] CR2: 000055f70bad3fb0 CR3: 000000025243a004 CR4: 00000000003706e0
[ 1682.900412] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1682.900413] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1682.900415] Call Trace:
[ 1682.900418]  <TASK>
[ 1682.900420]  ? xe_device_assert_mem_access+0x27/0x30 [xe]
[ 1682.900504]  ? __warn+0x85/0x170
[ 1682.900510]  ? xe_device_assert_mem_access+0x27/0x30 [xe]
[ 1682.900596]  ? report_bug+0x171/0x1a0
[ 1682.900604]  ? handle_bug+0x3c/0x80
[ 1682.900608]  ? exc_invalid_op+0x17/0x70
[ 1682.900612]  ? asm_exc_invalid_op+0x1a/0x20
[ 1682.900621]  ? xe_device_assert_mem_access+0x27/0x30 [xe]
[ 1682.900706]  ? xe_device_assert_mem_access+0x12/0x30 [xe]
[ 1682.900790]  guc_ct_send_locked+0xb9/0x1550 [xe]
[ 1682.900882]  ? lock_acquire+0xca/0x2b0
[ 1682.900885]  ? guc_ct_send+0x3c/0x1a0 [xe]
[ 1682.900977]  ? lock_is_held_type+0x9b/0x110
[ 1682.900984]  ? __mutex_lock+0xc0/0xb90
[ 1682.900989]  ? __pfx___drm_printfn_info+0x10/0x10
[ 1682.900999]  guc_ct_send+0x53/0x1a0 [xe]
[ 1682.901090]  ? __lock_acquire+0xf22/0x21b0
[ 1682.901097]  ? process_one_work+0x1a0/0x500
[ 1682.901109]  xe_guc_ct_send+0x19/0x50 [xe]
[ 1682.901202]  set_min_preemption_timeout+0x75/0xa0 [xe]
[ 1682.901294]  disable_scheduling_deregister+0x55/0x250 [xe]
[ 1682.901383]  ? xe_sched_process_msg_work+0x76/0xd0 [xe]
[ 1682.901467]  ? lock_release+0xc9/0x260
[ 1682.901474]  xe_sched_process_msg_work+0x82/0xd0 [xe]
[ 1682.901559]  process_one_work+0x20a/0x500

v2: Add the splat

Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Vinay Belgaumkar <vinay.belgaumkar@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
drivers/gpu/drm/xe/xe_exec_queue.c