automl/HpBandSter

Saving the state?

netheril96 opened this issue · 1 comments

I am surveying different packages for hyperparameter optimization, and HpBandSter seems promising, especially becaues of its support for distributed training. But one thing I haven't had a clue is how the master handles interruption. Typically training a model takes a long time, so the master should be alive for even longer (it must outlive all workers combined). But what happens when the master crashes/is preempted?

Then the whole optimization run will crash. You will be able to resume it, if you logged the intermediate results. Resuming here means that the master can build the same model as before, but running jobs from any workers will not be recovered.
In case you ask that because you want to run everything on a cluster where there is a fairly strict time limit for any jobs, I recommend running the master either on the login node or some other machine that is reachable from the compute nodes.
Usually, the master doesn't crash. We hat runs over several days, up to two weeks I think, without any major problems.