HazyResearch/metal

Checkpoint cleanup for simultaneous experiments

jdunnmon opened this issue · 1 comments

Currently, if multiple experiments are running in parallel, they all use the same checkpoints directory by default. This is a problem, because they are then all overwriting the same best_model.pth asynchronously, which can cause experiment A to load experiment B's checkpoint.

We should move the checkpoints for a given experiment to its log folder, and then clean them up by default when the run is complete.

Resolved in v0.5, I believe.