MegEngine/MegDiffusion

Handle with saving checkpoint failed

ChaiByte opened this issue · 0 comments

If the machine is preemptive, it might be scheduled to be preempted (or encounter other situations that cause the machine to go down). If the checkpoint is being saved at the exact moment, the original data will be corrupted. Therefore, it is reasonable to keep multiple backups locally. Considering the disk space occupancy, it is better to support cloud storage, such as supporting the use of AWS s3.