Check valid destination of trainings API before executing training

Question

Check valid destination of trainings API before executing training

mrb opened this issue a year ago · 11 comments

Ran into an issue where the training was successful but the job failed because the status didn't get transmitted along with the python data for some reason:

>>> training.destination
>>> training = replicate.trainings.create(
...   destination="mrb/zimm2",
...   version="replicate/flan-t5-xl:7a216605843d87f5426a10d2cc6940485a232336ed04d655ef86b91e020e9210",
...   input={
...     "train_data": "https://gist.githubusercontent.com/mrb/5a704562c19ddb59c9654f55f7cfc382/raw/bf4a0d79e510bddbf552128844626b6f9df70168/zimm.jsonl",
...   },
... )
>>> training
Training(id='nee5zglb7i4ragkhzfomng3bsu', completed_at=None, created_at='2023-07-07T18:42:37.915023137Z', destination=None, error=None, input={'train_data': 'https://gist.githubusercontent.com/mrb/5a704562c19ddb59c9654f55f7cfc382/raw/bf4a0d79e510bddbf552128844626b6f9df70168/zimm.jsonl'}, logs='', output=None, started_at=None, status='starting', version=No

Answer 1 · 2023-07-07T20:41:39.000Z

Hi @mrb. Replicate's API validates the destination of trainings on creation, so a training shouldn't fail at the end (unless the destination model changes its availability). So this seems to be an incorrect representation of state than the actual training feature not working as intended.

The training object returned from replicate.trainings.create represents the state of that training at the time of creation. Since destination is None, either that field wasn't sent by the API or that property wasn't deserialized by the client correctly. I'll look into which of those is causing the problem.

Answer 2 · 2023-07-08T00:05:29.000Z

@mattt Thanks -- I based the information above on my experience with a specific training (p4dvy53bogm4rljoicrij2q3vy if you can look up the id on the backend), where the training was successful, but at the end, it errored, saying something to the effect of "Destination not available"

Answer 3 · 2023-07-08T11:26:28.000Z

Hey @mrb, thanks for providing that reference. I went through our logs and found what happened with your training: The error you saw was "Failed to create trained image after successful training run", which occurred because the API got an authentication failure to our package registry when attempting to connect to our container registry.

If it's any consolation, the failure occurred after weights were uploaded. So if you do replicate.trainings.get("p4dvy53bogm4rljoicrij2q3vy"), you can download those weights from the URL in the output and use those to create a new image manually.

I didn't see anything in that stack trace to suggest that the problem had anything to do with the Python client, so there's not much else to do here.

Answer 4 · 2023-08-16T23:49:55.000Z

Hey @mrb, following up on this — please get in touch if you see any more failed trainings. At this point, I'm pretty sure this was a problem with the API rather than the Python client, so I'm going to go ahead and close this issue.

Answer 5 · 2023-12-14T20:12:54.000Z

Hello @mattt
Can you help me out with this issue?
Thanks!

Traceback (most recent call last):
  File "/Users/josephani/Documents/Training/app.py", line 8, in <module>
    training = replicate.trainings.create(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/replicate/training.py", line 260, in create
    resp = self._client._request(
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/replicate/client.py", line 85, in _request
    _raise_for_status(resp)
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/replicate/client.py", line 358, in _raise_for_status
    raise ReplicateError(resp.json()["detail"])
replicate.exceptions.ReplicateError: The specified training destination does not exist

Answer 6 · 2023-12-15T11:49:26.000Z

@JosephAni It sounds like the model you specified in the destination doesn't exist. You'll need to create it on the website or with replicate.models.create.