A completely automated podcast audio transcription workflow with super accurate results!
Note: this project was presented in AWS Bites Podcast. Check out the full episode! 👈
This project uses:
- OpenAI Whisper for super accurate transcription
- Amazon Transcribe to add speaker identification
- FFmpeg for audio transcoding to MP3
- AWS Lambda for:
- Merging the Whisper and Transcribe results
- Substituting commonly 'misheard' words/proper nouns
- ...and Step Functions to orchestrate the whole process!
This project consists of a few components, each with their own CloudFormation Stack:
- 👂 whisper-image, for creating an ECR container image repository where we store the SageMaker container to run the Whisper model
- 🪣 data-resources for shared data stores, namely an S3 Bucket
- 🧠sagemaker-resources for the SageMaker model and IAM role
- 🎙 transcript-orchestration, for orchestration and transcript merging
This project uses AWS SAM with nested stacks to deploy all but the first of these components. That first component is special, since we need to create the container image respository with Amazon ECR where we can push our custom Whisper container image. That makes the image available to be loaded by the SageMaker resources we can then create.
You will need the following build tooling installed.
- Node.js 18.x and NPM 8.x
- Docker, or other tooling that can build a container image from a
Dockerfile
and push it to a repository. - AWS SAM, used to build and deploy most of the application
- The AWS CLI
- esbuild
- SLIC Watch: By default, the target AWS account should have the SLIC Watch SAR Application installed. It can be installed by going to _this page in the AWS Console. SLIC Watch is used to create alarms and dashboards for our transcription application. If you want to skip this option, just remove the single line referring to the
SlicWatch-v2
macro from the relevant template, transcript-orchestration/template.yaml.
You can deploy this complete application to your own AWS account.
-
Make sure to set the environment variables for the AWS region and profile
export AWS_PROFILE=xxx export AWS_DEFAULT_REGION=eu-central-1 export AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
-
The first deployment step creates the ECR repository. We can use the AWS CLI to do this with CloudFormation:
aws cloudformation deploy \ --template ./whisper-image/template.yaml \ --stack-name whisper-image \ --tags file://./common-tags.json \ --capabilities CAPABILITY_NAMED_IAM
We can now retrieve the repostiory URI from the CloudFormation outputs:
REPOSITORY_URI=$(aws cloudformation describe-stacks --stack-name whisper-image --query "Stacks[0].Outputs[?ExportName=='whisper-model-image-repository-uri'].OutputValue" --output text)
-
Next, we can build and push the Whisper container image:
cd whisper-image # Build the container image docker build --platform linux/amd64 -t $REPOSITORY_URI . # Log in to ECR with Docker (make sure to set AWS_REGION and AWS_ACCCOUNT_ID) aws ecr get-login-password | docker login --username AWS --password-stdin $AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com # Push the container image to ECR docker push $REPOSITORY_URI # leave directory before executing next step cd ..
-
Now that our container image is present, we can deploy the rest of the application with AWS SAM.
sam build --parallel sam deploy --guided --capabilities CAPABILITY_AUTO_EXPAND CAPABILITY_IAM # It should be sufficient to accept all defaults when prompted
That's it! You can now test the entire transcription flow. The entire process is trigged when you upload an audio file to the newly-created S3 Bucket:
aws s3 cp sample-audio/sample1.mp3 s3://pod-transcription-${AWS_ACCOUNT_ID}-${AWS_REGION}/audio/sample1.mp3
That S3 object upload will create an EventBridge event to trigger the transcription Step Function. You can watch its progress in the Step Functions Console.
To have a better feeling for what the process looks like you can check out the following picture for a visualization of the Step Function definition: