Failed analyses do not update as "failed" in the UI
ablack3 opened this issue · 8 comments
As soon as it is clear that the R code has failed to run the datanode UI should update to reflect this (i.e. execution failed). Instead what I currently see is that analyses will fail which is clear from the docker logs of the datanode container but the UI still says "executing" for quite a long time.
3 cases here:
-
DataNode was not able to reach Execution Engine and submit analysis. In this case the status of analysis should be marked as "Failed".
-
There is no response from Execution Engine about the status of analysis for a long time. In case callback configuration was not set correctly and Execution Engine was not able to sent callback. In this case we need to invalidate the job after some time. E.g, if there is no response from Execution Engine during a 1 hour, it means the fob was failed. 1 hour should be parametrised.
-
We need to invalidate and marked as "FAILED" all jobs that are in Executing state during the Data Node restart.
Executing/Aborting => Failed
The following adjustments are to be made:
- On the execution engine side, provide new endpoint
/api/v1/status?id=1&id=2...
to query status of the running analyses by id. - Regularly call this endpoint to update status for incomplete analysis.
- On the datanode, the endpoint
/api/v1/admin/submissions
is to feature a newengine
field to report engine status as follows:
{
status: "OK | ERROR | UNKNOWN",
since: <timestamp when the status was seen first time>
seenLast: <timestamp when the status was seen last time>;
error: <error message if any>
}
- Frontend to be updated to show a clear visual indication when execution engine status is "ERROR" and the error message. Whether other statuses are to be displayed somehow is TBD
I'm testing the latest release. I can see in the docker logs for arachne execution engine that my analysis has failed and is no longer running. However from the Arachne datanode UI it appears as if the analysis is still running so the user has no idea the code has failed. The timer for the study continues to run as well. Also the logs in the UI are blank so there is no way to diagnose the problem without digging into docker logs which our users might not be able to do.
I suggest we print some messages to the log in the Arachne UI to let the user know what is happening.
I may assume that it is fixed by this commit a5e19eb
Let's wait for ARACHNE Datanode release and verify it.
Can we also print a message in the log that indicates that the docker image is being pulled, or was found locally, and that the analysis is starting up? I think there should be a couple log messages prior to the R code running just to let us know that the environment is working.
I like the suggestion.
I just checked and all the log messages are in place already
https://github.com/OHDSI/ArachneExecutionEngine/blob/develop/engine/src/main/java/com/odysseusinc/arachne/executionengine/execution/r/DockerService.java#L104
@ablack3 are you getting them?
Closing as initial request was completed. In case of new issues, please raise a separate ticket.