habanalabs: there is no kernel TDR in future ASICs
authorOded Gabbay <ogabbay@kernel.org>
Thu, 13 Jan 2022 08:05:38 +0000 (10:05 +0200)
committerOded Gabbay <ogabbay@kernel.org>
Mon, 28 Feb 2022 12:22:02 +0000 (14:22 +0200)
In future ASICs, there is no kernel TDR for new workloads that are
submitted directly from user-space to the device.

Therefore, the driver can NEVER know that a workload has timed-out.

So, when the user asks us to wait for interrupt on the workload's
completion, and the wait has timed-out, it doesn't mean the workload
has timed-out. It only means the wait has timed-out, which is NOT an
error from driver's perspective.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
drivers/misc/habanalabs/common/command_submission.c

index 2f40b93..29e0549 100644 (file)
@@ -2932,11 +2932,14 @@ static int _hl_interrupt_wait_ioctl(struct hl_device *hdev, struct hl_ctx *ctx,
                                rc = -EIO;
                                *status = HL_WAIT_CS_STATUS_ABORTED;
                        } else {
-                               dev_err_ratelimited(hdev->dev, "Waiting for interrupt ID %d timedout\n",
-                                               interrupt->interrupt_id);
-                               rc = -ETIMEDOUT;
+                               /* The wait has timed-out. We don't know anything beyond that
+                                * because the workload wasn't submitted through the driver.
+                                * Therefore, from driver's perspective, the workload is still
+                                * executing.
+                                */
+                               rc = 0;
+                               *status = HL_WAIT_CS_STATUS_BUSY;
                        }
-                       *status = HL_WAIT_CS_STATUS_BUSY;
                }
        }
 
@@ -3049,6 +3052,12 @@ wait_again:
                        interrupt->interrupt_id);
                rc = -EINTR;
        } else {
+               /* The wait has timed-out. We don't know anything beyond that
+                * because the workload wasn't submitted through the driver.
+                * Therefore, from driver's perspective, the workload is still
+                * executing.
+                */
+               rc = 0;
                *status = HL_WAIT_CS_STATUS_BUSY;
        }