The number of Running and Done tasks can get more than the total number of tasks
pooya opened this issue · 3 comments
The reason this happens is that we are getting the information about the "running" and "done" tasks from two different sources. We first consult the disco_server and get the information about the "running" tasks and then we consult the job event handlers and get the information about the "done" tasks. If any of the tasks finishes in this small window of time, it will be counted both as a running and as a done task which results in the inconsistency.
One way to avoid this problem is to first get the "done" tasks and then the running tasks. In that case, the inconsistencies will be counted as "waiting" tasks and is more acceptable.
Your explanation for the cause makes sense but seems to suggest its a UI issue. Do you have any idea why, when we see this, the job always seems to hang indefinitely with negative waiting count? No further progress is made and nothing is actually run on the job once we see the count go negative on the ui.
There was a bug in 0.5.2 with the same symptoms that caused the job to hang. Please upgrade to 0.5.3. If you still have this issue in 0.5.3, then it is a different issue and should be tracked and fixed separately.