[Question] Resume training from a checkpoint

Question

[Question] Resume training from a checkpoint

sevashasla opened this issue a year ago · 2 comments

Hello! Thank you for the great work.
I had been training a model for approximately 20 hours when an error occurred with my computer, causing the training to stop. Is there a way to resume training from the checkpoint? I saw the line TODO: Add midpoint loading and the commented code after it. I could try to implement it by myself, and could you please share the potential problems?

Answer 1 · 2023-08-18T23:57:50.000Z

Hi,
I will not have time to look into it this week.
In addition to loading checkpoints (see here), we need to handle the training dataloader train_dataset properly so that it provides the currently training images. I would use local_tensorfs.blending_weights[:, -1] > 0 to determine which frames should be activated / deactivated in the training dataset.

Answer 2 · 2023-08-19T11:23:40.000Z

Thank you for your fast answer!