victorchall/EveryDream-trainer

Running but no checkpoints saved

celticsha opened this issue · 4 comments

Sometimes this runs beautifully on my pc. Other times, it seems to run but doesn't save anything to the checkpoints directory.

By default the yamls provided will save a checkpoint at the end of every epoch. Are you getting to the end of at least one epoch? Should be when the main steps bar fills...

It may take a while if your repeats and number of training images is large. You can get checkpoints more frequently by reducing repeats in the config YAML. Details are in the readme.

The new code push in the last 12 hours will now lower system memory use, it should help quite a bit because the CKPT saving takes a lot of system ram, using either RAM or virtual memory/pagefile. Hopefully this will get rid of quite a few of the problems people have on runpods when training thousands of images.

Open a new issue if you still have issues on the latest code. Full dump of your log will be helpful.