Check valid destination of trainings API before executing training
mrb opened this issue · 11 comments
Ran into an issue where the training was successful but the job failed because the status didn't get transmitted along with the python data for some reason:
>>> training.destination
>>> training = replicate.trainings.create(
... destination="mrb/zimm2",
... version="replicate/flan-t5-xl:7a216605843d87f5426a10d2cc6940485a232336ed04d655ef86b91e020e9210",
... input={
... "train_data": "https://gist.githubusercontent.com/mrb/5a704562c19ddb59c9654f55f7cfc382/raw/bf4a0d79e510bddbf552128844626b6f9df70168/zimm.jsonl",
... },
... )
>>> training
Training(id='nee5zglb7i4ragkhzfomng3bsu', completed_at=None, created_at='2023-07-07T18:42:37.915023137Z', destination=None, error=None, input={'train_data': 'https://gist.githubusercontent.com/mrb/5a704562c19ddb59c9654f55f7cfc382/raw/bf4a0d79e510bddbf552128844626b6f9df70168/zimm.jsonl'}, logs='', output=None, started_at=None, status='starting', version=No
Hi @mrb. Replicate's API validates the destination of trainings on creation, so a training shouldn't fail at the end (unless the destination model changes its availability). So this seems to be an incorrect representation of state than the actual training feature not working as intended.
The training
object returned from replicate.trainings.create
represents the state of that training at the time of creation. Since destination
is None
, either that field wasn't sent by the API or that property wasn't deserialized by the client correctly. I'll look into which of those is causing the problem.
@mattt Thanks -- I based the information above on my experience with a specific training (p4dvy53bogm4rljoicrij2q3vy if you can look up the id on the backend), where the training was successful, but at the end, it errored, saying something to the effect of "Destination not available"
Hey @mrb, thanks for providing that reference. I went through our logs and found what happened with your training: The error you saw was "Failed to create trained image after successful training run", which occurred because the API got an authentication failure to our package registry when attempting to connect to our container registry.
If it's any consolation, the failure occurred after weights were uploaded. So if you do replicate.trainings.get("p4dvy53bogm4rljoicrij2q3vy")
, you can download those weights from the URL in the output
and use those to create a new image manually.
I didn't see anything in that stack trace to suggest that the problem had anything to do with the Python client, so there's not much else to do here.
Hey @mrb, following up on this — please get in touch if you see any more failed trainings. At this point, I'm pretty sure this was a problem with the API rather than the Python client, so I'm going to go ahead and close this issue.
Hello @mattt
Can you help me out with this issue?
Thanks!
Traceback (most recent call last):
File "/Users/josephani/Documents/Training/app.py", line 8, in <module>
training = replicate.trainings.create(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/replicate/training.py", line 260, in create
resp = self._client._request(
^^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/replicate/client.py", line 85, in _request
_raise_for_status(resp)
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/replicate/client.py", line 358, in _raise_for_status
raise ReplicateError(resp.json()["detail"])
replicate.exceptions.ReplicateError: The specified training destination does not exist
@JosephAni It sounds like the model you specified in the destination
doesn't exist. You'll need to create it on the website or with replicate.models.create
.