replicate/replicate-python

How to monitor model training in python?

YouthDream0925 opened this issue · 1 comments

I tried to monitor model training in python.

def model_training():
    try:
        training = replicate.trainings.create(
            version="stability-ai/sdxl:c221b2b8ef527988fb59bf24a8b97c4561f1c671f73bd389f866bfb27c061316",
            input={
                "input_images": input_images,
            },
            destination=destination
        )

        training.reload()
        print(training.status)
        print("\n".join(training.logs.split("\n")[-10:]))

        return True
    except Exception as e:
        print(str(e))
        return False

But I faced an error : 'NoneType' object has no attribute 'split', What's wrong in my code? I followed sample here

How can I monitor model training process in python?

Hi @YouthDream0925. The reason why you're seeing this error is that training.logs isn't populated at the time you've called split on it. In the code you shared, the training is reloaded immediately after creation, so it's unlikely to have started yet and have its logs property populated.

Here's some example code for how you might monitor the progress of a training job (note that trainings take at least several minutes to complete, and as long as several hours, so a polling interval of less than 30 seconds isn't recommended):

while training.status not in ["succeeded", "failed", "canceled"]:
  if training.logs is not None:
    print("\n".join(training.logs.split("\n")[-10:]))
  time.sleep(30)
    training.reload()