metrumresearchgroup/bbr

Consider fail-safe for checking on terminated models

Opened this issue · 4 comments

In #685 (comment) @kyleam pointed out numerous situations where a model process can be terminated, but not write out either bbi_config.json or write "Stop Time:" to the .lst file.

To address these, we could potentially build in a fail-safe where we hash OUTPUT files every 10 iterations or so, and “consider dead” if a process hasn’t written to it in <N> iterations. This might be kinda hacky and extreme, but we may want to add it if we’re worried about enough cases where a model stops without writing a bbi_config.json (which could lead to endlessly checking back, thinking it's still running). Related thoughts:

  • Would we consider writing our own bbi_config.json (from bbr) that “pronounced it dead”? I kind of like this idea, but probably needs more thought.
  • if we did this ^ and then we were wrong (i.e. the model was still running), and then bbi tried to write its own bbi_config.json later… what would happen? If it just overwrites the previous one, is that actually just ok and reasonable?

My initial thought is that I see the motivation, but I'm worried that "every N iterations" may not be calibrated right to account for complicated models that take more time than expected to move (perhaps not as much of a worry in the context of bootstrap?). I suppose conceptually it's just inherently racy, which makes me nervous, but I know you're saying it might be good enough and do more good than harm in practice.

Anyway, my main comment is that...

fail-safe where we hash OUTPUT files every 10 iterations or so, and “consider dead” if a process hasn’t written to it in <N> iterations

... for this use case (a file that's just being appended to), I think you could just stat the file and look at whether the size has changed (i.e. file.size() from R). That'll be sufficient for the check and more efficient than hashing the content.

and then bbi tried to write its own bbi_config.json later… what would happen?

Based on this test, bbi would just overwrite the file:

bbi nonmem run local 1.ctl &
sleep 0.5
echo alien >1/bbi_config.json
echo 'alien placed; exiting'

Based on this test, bbi would just overwrite the file

And... do we think this is fine? I'm leaning towards yes, because it would basically be

  • We thought it was dead, so we marked it dead
  • Any checks will not wait for it, and model_summary() calls (while it's still running) will fail (which is what they'd do anyway while it's still running)
    • If we add in some custom handling of our new "it's dead" bbi_config.json then any intermediate run of config_log() or something might give misleading results, but this is still a hypothetical edge case.
  • The model finishes and overwrite the bbi_config.json
  • Future model_summary(), etc. calls will reflect the model finishing successfully
  • I don't think there will be any evidence left behind that this was once "considered dead", which is probably fine.

Do you agree with that assessment?

@seth127 Yes, that sounds reasonable to me.