Consider fail-safe for checking on terminated models
Opened this issue · 4 comments
In #685 (comment) @kyleam pointed out numerous situations where a model process can be terminated, but not write out either bbi_config.json
or write "Stop Time:" to the .lst
file.
To address these, we could potentially build in a fail-safe where we hash OUTPUT
files every 10 iterations or so, and “consider dead” if a process hasn’t written to it in <N>
iterations. This might be kinda hacky and extreme, but we may want to add it if we’re worried about enough cases where a model stops without writing a bbi_config.json
(which could lead to endlessly checking back, thinking it's still running). Related thoughts:
- Would we consider writing our own
bbi_config.json
(frombbr
) that “pronounced it dead”? I kind of like this idea, but probably needs more thought. - if we did this ^ and then we were wrong (i.e. the model was still running), and then
bbi
tried to write its ownbbi_config.json
later… what would happen? If it just overwrites the previous one, is that actually just ok and reasonable?
My initial thought is that I see the motivation, but I'm worried that "every N iterations" may not be calibrated right to account for complicated models that take more time than expected to move (perhaps not as much of a worry in the context of bootstrap?). I suppose conceptually it's just inherently racy, which makes me nervous, but I know you're saying it might be good enough and do more good than harm in practice.
Anyway, my main comment is that...
fail-safe where we hash
OUTPUT
files every 10 iterations or so, and “consider dead” if a process hasn’t written to it in<N>
iterations
... for this use case (a file that's just being appended to), I think you could just stat the file and look at whether the size has changed (i.e. file.size()
from R). That'll be sufficient for the check and more efficient than hashing the content.
and then
bbi
tried to write its ownbbi_config.json
later… what would happen?
Based on this test, bbi
would just overwrite the file:
bbi nonmem run local 1.ctl &
sleep 0.5
echo alien >1/bbi_config.json
echo 'alien placed; exiting'
Based on this test, bbi would just overwrite the file
And... do we think this is fine? I'm leaning towards yes, because it would basically be
- We thought it was dead, so we marked it dead
- Any checks will not wait for it, and
model_summary()
calls (while it's still running) will fail (which is what they'd do anyway while it's still running)- If we add in some custom handling of our new "it's dead"
bbi_config.json
then any intermediate run ofconfig_log()
or something might give misleading results, but this is still a hypothetical edge case.
- If we add in some custom handling of our new "it's dead"
- The model finishes and overwrite the
bbi_config.json
- Future
model_summary()
, etc. calls will reflect the model finishing successfully - I don't think there will be any evidence left behind that this was once "considered dead", which is probably fine.
Do you agree with that assessment?