This is the code used for my GitHub Universe presentation on using GitHub Actions with EMR Serverless.
There is also a workshop you can use with step-by-step instructions: Build analytics applications using Apache Spark with Amazon EMR Serverless
- An AWS Account with Admin privileges
- GitHub OIDC Provider in AWS
- S3 Bucket(s)
- EMR Serverless Spark application(s)
- IAM Roles for GitHub and EMR Serverless
You can create all of these, including some sample data, using the included CloudFormation template.
Warning 💰 The CloudFormation template creates EMR Serverless applications that you will be charged for when integration tests AND the scheduled workflow runs.
Note The IAM roles created in the template are very tightly scoped to the relevant S3 Buckets and EMR Serverless applications created by the stack.
To follow along, just fork this repository into your own account, clone it locally and do the following:
- Create the CloudFormation Stack
aws cloudformation create-stack \
--stack-name gh-severless-spark-demo \
--template-body file://./template.cfn.yaml \
--capabilities CAPABILITY_NAMED_IAM \
--parameters ParameterKey=GitHubRepo,ParameterValue=USERNAME/REPO ParameterKey=CreateOIDCProvider,ParameterValue=true
GitHubRepo
is theuser/repo
format of your GitHub repository that you want your OIDC role to be able to access.CreateOIDCProvider
allows you to disable creating the OIDC endpoint for GitHub in your AWS account if it already exists.
- Create an "Actions" Secret in your repo
Go to your repository settings, find Secrets
on the left-hand side, then Actions
. Click "New repository secret" and add a secret named AWS_ACCOUNT_ID
with your 12 digit AWS Account ID.
Note This is not sensitive info, just makes it easier to re-use the Actions.
- Update the Application IDs
- In
integration-test.yaml
, replaceTEST_APPLICATION_ID
with theTestApplicationId
output from the CloudFormation stack - In
run-job.yaml
, replacePROD_APPLICATION_ID
with theProductionApplicationId
output from the CloudFormation stack
The rest of the environment variables in your workflows should stay the same unless you deployed in a region other than us-east-1
.
With that done, you should be able to experiment with pushing new commits to the repo, opening pull requests, and running the "Fetch Data" workflow.
You can view the status of your job runs in the EMR Serverless console.
The demo goes into 4 specific use cases, each defined as part of a different GitHub Action. These are intended to be easily reusable
The unit-tests.yaml
file defines a very simple GitHub Action that runs on any push
event. It runs the tests in the pyspark/tests/test_basic.py.
integration-test.yaml
runs on any Pull Request and both 1/ copies the local pyspark code to S3 and 2/ runs an EMR Serverless job and waits until it's complete.
When a semantic-versioned tag is added to the repository, deploy.yaml
zips up files in the jobs
folder, and copies the zip and main.py
files to S3 in a location with the tag as part of the prefix.
run-job.yaml
runs the main.py
script on a schedule with the version defined in the JOB_VERSION
variable. The workflow_dispatch
section also lets you run the job manually, which by default uses the "latest" semantic tag on the repository.