fourTheorem/podwhisperer

`SageMaker CreateTransformJob` Errors Out

r-token opened this issue · 11 comments

Hi there! Love the AWS Bites podcast. Thank you for open-sourcing this work.

Currently, the SageMaker CreateTransformJob is failing for me. Here's what my state machine graph looks like:
Screenshot 2023-03-29 at 10 59 02 PM

You'll notice that the HeadObject Whisper Output step fails as well. I assume these two issues are related. I've detailed both for you below.

  1. The HeadObject Whisper Output error message is an S3.NoSuchKeyException. And I can confirm that there is no whisper-batch-output/file.json object within the bucket specified. How is that supposed to get written? Here's the full error:
{
  "resourceType": "aws-sdk:s3",
  "resource": "headObject",
  "error": "S3.NoSuchKeyException",
  "cause": "null (Service: S3, Status Code: 404, Request ID: PCFHSWJ9ZXYPKDGS, Extended Request ID: 77ZfX0HJkW9HKcRD7/ivT6PJg38uwVNxriDd661AkjgDxRyZH4qZdJpvnFjAT7xLOSiuhhJBLlc=) (Service: S3, Status Code: 404, Request ID: PCFHSWJ9ZXYPKDGS)"
}
  1. The SageMaker CreateTransformJob error message in CloudWatch simply says exec /app/serve: exec format error. There are a few ideas online as to what might cause this, but I'm out of my wheelhouse here. And for all I know, this could be caused by the HeadObject Whisper Output error listed above.

Any idea on what could be going wrong? Happy to provide any additional information that might be helpful. I did take what you have here and adjust it to work with the Serverless Framework instead of AWS SAM - so something may have simply gotten lost in translation.

Thanks again for publishing this!

@r-token From the exec format error, the immediate suggestion would be to check that the architecture of the container image pushed to ECR is amd64 and not arm64. This is a problem you can easily encounter on Mac M1/M2 if you don't explicitly set the image architecture when you build.
Having said that, the Dockerfile specifies the target platform so this should be automatic for you.

Can you run

docker image inspect <IMAGE>

and check the value of the Architecture property?

You can then explicitly set the architecture when you build:

docker build --platform linux/amd64 .....

Please let us know how you get on so we can update the code/docs as needed.

That was it @eoinsha! I did see arm64 in the Architecture property after running that docker image inspect command. Adding --platform in the docker build command fixed that exec format error. Thank you!

I can tell the SageMaker job is actually running now, but it still fails. CloudWatch now shows Exception in worker process and SystemError: initialization of _internal failed without raising an exception

The traceback shows the following:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/gunicorn/arbiter.py", line 589, in spawn_worker
    worker.init_process()

As well as Shutting down: Master and Reason: Worker failed to boot. Any insight here?

I haven't seen that error and nothing obvious springs to mind.
Have you changed the SageMaker instance size or anything? Could it be running out of memory?

No, I haven't changed anything about the SageMaker instance. Interesting. I'll keep looking into it.

I did see an error when running the docker build command that said ERROR: numba 0.56.4 has requirement numpy<1.24,>=1.18, but you'll have numpy 1.24.2 which is incompatible. I wonder if that has anything to do with it.

I ended up needing to do the following:

  1. Add numpy==1.23.5 to the requirements.txt file
  2. Add the --preload flag to the subprocess.Popen(['gunicorn' command in the serve file (this enables more robust CloudWatch logs)

Those steps resulted in the Whisper transcription working for a given audio file (woo!), but Step Functions doesn't seem to understand when that SageMaker transform job is done. It times out on the SageMaker CreateTransformJob step despite the transcription working and uploading the .json.out file to S3.

Do you know why the Batch Transform Job might not be communicating to Step Functions that it finished the job successfully? It just times out instead. I tried adding the states:SendTaskSuccess IAM permission to the WhisperExecutionRole, but no luck.

The last thing that shows in the SageMaker TransformJobs CloudWatch logs is: "POST /invocations HTTP/1.1" 200 1007 "-" "Go-http-client/1.1". I'm assuming the gunicorn subprocess is not returning properly, as nothing logged after that appears - but I can't figure out why it isn't returning.

Update: it looks like it's stuck inside the PID while loop:

pids = set([nginx.pid, gunicorn.pid])
while True:
    print('in the pid while loop')
    pid, _ = os.wait()
    if pid in pids:
        break

print('Past the pid while loop, calling sigterm_handler')

I can see "in the pid while loop" printed repeatedly. However, the print('Past the pid while loop, calling sigterm_handler') command never fires. Any idea why this might be happening and how I can get past it?

If I remove that while loop manually and force it to move on, it then gets stuck inside the sigterm_handler function and never prints Inference server exiting. I can see it does kill the nginx_pid and gunicorn_pid successfully and gets to the sys.exit(0) call. Still unsure why it's not moving on from here after that. Any insight would be much appreciated @eoinsha

Ok, never mind! It was not stuck in that loop - I blame my lack of Python knowledge for not understanding that better. It simply takes a long time for Whisper to transcribe the audio, at least on the instance type I'm currently running (I tried both ml.m4.xlarge and ml.p3.2xlarge) - both take about 8 minutes to transcribe a 30 second audio file.

I think we can close this issue now @eoinsha. However, I am surprised it's taking 8 minutes to complete the transcription for a 30 second audio clip. Is that expected? I am planning on running it on a ml.p3.16xlarge to see the differences, but first need AWS to increase the quota for that instance type on my account.

Regardless, the core changes necessary for me were:

  1. Add the --platform linux/amd64 flag to the docker build command
  2. Add numpy==1.23.5 to the requirements.txt file
  3. Add the --preload flag to the gunicorn command to enable more robust CloudWatch logs

After further investigation, I've found that the speed bottleneck is not the actual Whisper transcription process. The transcription itself is only taking about 3 seconds for a 30 second audio clip. So that's great.

The bottleneck seems to be how long it takes the SageMaker Batch Transform Job to load in the 5 GB Docker image from ECR. That is what I think is taking about 8 minutes. Once it's loaded in, the job finishes quickly. Any advice on how to speed that up? Can SageMaker Batch Transform Jobs use a "warm" instance that always has that image loaded? Or is there another way to slim down the Docker image so it's quicker to load?

I have a pretty similar bug, my SageMaker CreateTransformJob is loading and loading, after 30min I stoped the Job. My HeadObject Whisper Output is also in Caught error. In the SageMaker CreateTransformJob logs is nothing the log is empty.

@r-token You're correct, that the transcription time is not the bottleneck here. There is a significant overhead in startup. I didn't observe that this was all down to ECR load. How did you measure this?
My assumption was that it was also down to SageMaker's scheduler overhead.

We typically use it for 30-60 minute audio. The duration is usually in the 30 minute range. For longer episodes, I have had to increase the instance size to avoid it timing out. @timojDE - what audio duration are you using?

I tried it with the sample audio files so about 10sec

Edit:
It works now but the next problem I have is that once the transcription is done, it still takes forever to display. So in SageMaker in the log it is already at 100% and nothing happens for an eternity.