Inspect sometimes hangs on extremely large runs
Opened this issue · 0 comments
MSchmatzAISI commented
Originally discussed on Slack here.
During multi-hour jobs, Inspect will sometimes hang. The user-visible manifestation of this bug is that the timers will stop. Anecdotally, memory and CPU usage do not appear to be high after the hang begins.
We currently do not have a consistent repro case of this behavior that doesn't involve running a real multi-hour job. In addition, when the issue happened, we were not able to diagnose the root cause with basic profiling steps.
Therefore, we'll need to:
- Create a consistent repro of the behavior
- Ideally we can do this without wasting compute on this (though that isn't guaranteed if the root cause is complex). We could try and make fairly artificial tasks using MockLLM or something similar and just run them for a very long time to see if the bug repros.
- More sophisticated mock tasks may be required to trigger the bug, such as ones that use Docker sandboxes.
- If we try the above steps and still can't repro the bug, it might be worth doing an actual run to see if it can be triggered as a last resort.
- Diagnose the issue once we can reproduce it at least occasionally
- @JJ Allaire recommends file logging for this so we can observe what it was doing right before it crashed.
- We likely will need to attach debugging tools to the stuck process to Inspect what went wrong.
- We can also look at things like what resource usage was like before the crash, what task it was on, etc. which might help make repros easier or give us some notion about root causes.