replicate/replicate-javascript

Prediction interrupted; please retry (code: PA)

eoinoreilly30 opened this issue · 3 comments

Similar to this issue - replicate/replicate-python#135

I get the error Prediction interrupted; please retry (code: PA) sometimes even when passing in the input file as a presigned URL (not in the body of the request). File size is only 10-20mb also

My function version: dec18ae11244f97ff00bdbe9cb7c060cbaf909ca60ac3a07b803783219042854

Failed prediction via API: tjvz6b95x9rge0cjq7ttf7agyr

Reran again via API and it succeeded: ydmnk0kt3hrgc0cjqt2tk7d3q8

My code:

const response = await fetch("https://api.replicate.com/v1/predictions", {
    method: "POST",
    headers: {
      Authorization: `Bearer ${REPLICATE_API_TOKEN}`,
      "Content-Type": "application/json",
    },
    body: JSON.stringify({
      version:
        "dec18ae11244f97ff00bdbe9cb7c060cbaf909ca60ac3a07b803783219042854",
      input: {
        input_video: greenscreenURL,
      },
    }),
  });

I’m getting this error well into the training process, so I don’t think it could be file size-related? Would be great if someone from the team deciphered what the “code PA” actually means:

image

This happens to some trainings but not others, so it’s not a persistent error.

The immediate cause for failures with this error is that the instance running the prediction stopped sending the heartbeat that indicates it's still running the prediction. The underlying cause can be a variety of issues, such as networking issues or the underlying compute going away.

A more common cause is that the process coordinating the work (and sending the heartbeats) gets OOM killed. That's why you might see this more often when handling large files with data URIs, as that can put a lot of pressure on memory.

Are you seeing this reliably for a given model and set of inputs? How are you sending input files?

A more common cause is that the process coordinating the work (and sending the heartbeats) gets OOM killed. That's why you might see this more often when handling large files with data URIs, as that can put a lot of pressure on memory.

Thanks, that might be it. I was sending data URIs from localhost as I didn't want to mess with hosting. But we do have S3 storage in the stage/production environment, so I tweaked the code to send data URIs from local and stored URLs otherwise, which seem to have solved the issue for stage/production.