This is a simple serverless application that helps automate preparing containers for use with AWS HealthOmics Workflows that performs two key functions:
container-puller
: Retrieves container images from public registries like (Amazon ECR Public, Quay.io, DockerHub) and stages them in Amazon ECR Private image repositories in your AWS accountcontainer-builder
: Builds ECR Private container images from source bundles staged in S3
Under the hood, it this application leverages AWS Step Functions, AWS CodeBuild, and Amazon ECR for much of the heavy lifting.
- The following software is required in your environment
Deploy the AWS CloudFormation stacks used by the application in each region you intend to run AWS HealthOmics Workflows using the following:
NOTE:
Ensure you are in the same directory where this file is located
# install package dependencies
npm install
# in your default region (specify profile if other than 'default')
cdk deploy --all --profile <aws-profile>
From here, you can proceed to either:
This is the most common case, and likely the only functionality of this CDK application you will use.
Create a container_pull_manifest.json
file with contents like:
{
"manifest": [
"ubuntu:20.04",
"us.gcr.io/broad-gatk/gatk:4.4.0.0",
"ghcr.io/miniwdl-ext/miniwdl-aws:v0.9.0-1-g243f36f",
"quay.io/biocontainers/bcftools:1.16--hfe4b78e_1",
"public.ecr.aws/docker/library/python:3.9.16-bullseye",
"quay/biocontainers/bwa-mem2:2.2.1--he513fc3_0",
"ecr-public/aws-genomics/google/deepvariant:1.4.0"
]
}
Execute the following to pull this list of container images into your ECR private registry:
aws stepfunctions start-execution \
--state-machine-arn arn:aws:states:<aws-region>:<aws-account-id>:stateMachine:omx-container-puller \
--input file://container_pull_manifest.json
The state-machine will pull source image uris into your private ECR registry according to the following rules:
-
Images are privatized into repositories with namespaces that correspond to their original source public registry - e.g. a full repository name is of the format
<namespace>/<image-name>
and a full private image uri will be of the form<aws-account-id>.dkr.ecr.<aws-region>.amazonaws.com/<namespace>/<image-name>:<image-tag>
-
Images from Quay.io or ECR Public (public.ecr.aws) use ECR pull through caching
- An image from Quay.io will be pulled into a repository with the
quay/
namespace - An images from public.ecr.aws will be pulled into a repository with the 'ecr-public/` namespace
- An image from Quay.io will be pulled into a repository with the
-
Images from other known public registries are pulled and pushed into a custom created ECR private repository using a CodeBuild project
- An image from Google Container Registry (gcr.io) will be pulled into a repository with the
gcr/
namespace - An image from Google Artifact Registry (pkg.dev) will be pulled into a repository with the
gar/
namespace - An image from Github Container Registry (ghcr.io) will be pulled into a repository with the
ghcr/
namespace - An image from Microsoft Container Registry (mcr.microsoft.com) will be pulled into a repository with the
mcr/
namespace - An image from DockerHub will be pulled into a repository with the
dockerhub/
namespace
- An image from Google Container Registry (gcr.io) will be pulled into a repository with the
-
Images from other (less common) public registries are not supported at this time
When the omx-container-puller
state machine completes, and all containers have been pulled into ECR Private successfully, you can can proceed to configuring and running your workflow.
This process is only necessary if you need to retrieve container images from ECR Private registries, either in your AWS account or from other AWS accounts.
This process works alongside retrieving container images from public registries.
Create a config file at the root level of this application call app-config.json
with contents like:
{
"container_puller": {
"source_aws_accounts": [
"111122223333",
"444455556666"
]
}
}
The source_aws_accounts
list specifies other AWS account ids that the application is allowed to pull ECR Private images from. By default, each account id will resolve to a corresponding ECR Private registry in the region that the CDK application is deployed to.
For the example above, if the CDK application was deployed to the us-east-1
AWS region, this would be:
111122223333.dkr.ecr.us-east-1.amazonaws.com
444455556666.dkr.ecr.us-east-1.amazonaws.com
To allow pulling containers across AWS regions, modify the config to be like:
{
"container_puller": {
"source_aws_accounts": [
"111122223333",
"444455556666"
],
"allow_cross_region_pull": true
}
}
Note: Cross-region container image pulls are typically slower and may incur additional cost.
(Re)deploy the application using:
cdk deploy --all
Once deployed you can start an AWS StepFunctions state machine execution as in Retrieving public containers, only now your container_pull_manifest.json
can contain image URIs like:
{
"manifest": [
"111122223333.dkr.ecr.us-east-1.amazonaws.com/foo:1.1.1",
"444455556666.dkr.ecr.us-east-1.amazonaws.com/bar:2.2.2"
]
}
The state-machine will pull source image uris into your private ECR registry according to the same rules as described in Retrieving public containers. This configuration adds the following:
- Images from other ECR Private registries are replicated in your ECR Private registry as
<your-aws-account-id>.dkr.ecr.<aws-region>.amazonaws.com/<source-ecr-image-repository-name>:<sourc-image-tag>
Note that this process is only necessary if you wish to build containers from scratch rather than simply retreiving them from a public repository as described above.
Create a config file at the root level of this application called app-config.json
with contents like:
{
"container_builder": {
"source_uris": [
"s3://my-bucket-1",
"s3://my-bucket-2/path/to/source"
]
}
}
The source_uris
list specifies S3 locations where source for container images (e.g. Dockerfiles
and accompanying assets) have been staged. You must create these locations beforehand.
(Re)deploy the application using:
cdk deploy --all
Sync container image source to the location(s) you specified above. These can either be bare source:
aws s3 sync ./path/to/container-source/image-foo s3://my-bucket-1/container-source/image-foo
or zip bundles:
(cd ./path/to/container-source/image-bar && zip -r ../image-bar.zip .)
aws s3 cp ./path/to/container-source/image-bar.zip s3://my-bucket/container-source/image-bar.zip
Create a container_build_manifest.json
with contents like the following:
{
"manifest": [
{
"source_uri": "s3://my-bucket/container-source/image-foo/",
"target_image": "foo:omics"
},
{
"source_uri": "s3://my-bucket/container-source/image-bar.zip",
"target_image": "bar:omics"
}
]
}
Execute the following to build this list of container images and place them in your ECR private registry:
aws stepfunctions start-execution \
--state-machine-arn arn:aws:states:<aws-region>:<aws-account-id>:stateMachine:omx-container-builder \
--input file://container_build_manifest.json
The state-machine will retrieve the source bundles and build images using the available Dockerfile
in each. Images are built and pushed to ECR private repositories that match the target_image
values in the input manifest. Full private image uris will be of the form <aws-account-id>.dkr.ecr.<aws-region>.amazonaws.com/<target_image>
. If you need images to have specific namespaces you can include them in the target_image
- e.g. mynamespace/foo:omics
.
After processing the manifests above, configure your workflow to use images from your ECR Private Registry:
// how this is defined depends on the workflow lanugage used, but is effectively something like:
// configure list of containers
ecr_registry = "aws-account-id>.dkr.ecr.<aws-region>.amazonaws.com"
containers = {
"task1": ecr_registry + "/dockerhub/ubuntu:20.04",
"task2": ecr_registry + "/gcr/broad-gatk/gatk:4.4.0.0",
"task3": ecr_registry + "/ghcr/miniwdl-ext/miniwdl-aws:v0.9.0-1-g243f36f",
"task4": ecr_registry + "/quay/biocontainers/bcftools:1.16--hfe4b78e_1",
"task5": ecr_registry + "/ecr-public/docker/library/python:3.9.16-bullseye",
"task6": ecr_registry + "/quay/biocontainers/bwa-mem2:2.2.1--he513fc3_0",
"task7": ecr_registry + "/ecr-public/aws-genomics/google/deepvariant:1.4.0",
"task8": ecr_registry + "/foo:omics",
"task9": ecr_registry + "/bar:omics"
}
// use containers[task_name] in task definitions
task task1 {
runtime {
container: containers['task1']
}
...
}
Then create and run your workflow using AWS HealthOmics.
The Amazon ECR Helper for HealthOmics is a serverless application and does not incur costs when idle.
Any Amazon ECR container image repositories created have a storage cost - see Amazon ECR pricing for more details.
The container-puller
stack automates "privatizing" container images - re-staging them from public registries to an ECR Private registry.
To do this, it first relies on ECR Pull-through caching which:
- allows docker clients to pull an image URI that looks like it comes from ECR Private
- creates a private image repository based on the public image that is being pulled
- pulls the public image and caches it in the private repository
Second, an EventBridge rule is used to detect when an image repository is created.
Third, the EventBridge rule triggers a Lambda function that applies the required access policy to the ECR repository that was created. This policy looks like:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "omics workflow access",
"Effect": "Allow",
"Principal": {"Service": "omics.amazonaws.com"},
"Action": [
"ecr:GetDownloadUrlForLayer",
"ecr:BatchGetImage",
"ecr:BatchCheckLayerAvailability"
]
}
]
}
Currently, AWS HealthOmics checks for the existence of ECR image repositories and specific image URIs before launching ECS tasks. This pre-check means you need to "prime" container images into ECR Private prior to a running a workflow that depends on them the first time, even if pull-through caching is enabled.
The priming process is automated by submitting a container image manifest to a StepFunctions state machine that calls a CodeBuild Project that retrieves container image URIs.
The mechanisms above are also generalized to support other public container registries in the following ways:
- The ECR CreateRepository API is called when a corresponding repository does not exist. This is only used when retriving images from public registries that do not support pull-through caching.
AwsApiCall
events that create ECR repositories with the tagKey=createdBy,Value=omx-ecr-helper
will also trigger the Lambda Function.- The CodeBuild project is parameterized to do either pull-through only or pull and push actions
To save costs the workflow will only run the CodeBuild project if a requested image uri does not already have a corresponding private ECR image.
The container-builder
stack automates building container images using AWS Step Functions and AWS CodeBuild. An ECR private repository is created as needed for the image. It also utilizes the same EventBridge rule, trigger, and Lambda above to add required ECR repository policies.
Note: There are no capabilities at this time to check if the image source has changed such that it would result in an updated image relative to an existing one in ECR Private. Therefore, CodeBuild project builds in this stack will always execute when processing a manifest. As a result, if images already exist in ECR Private they will be overwritten.
The cdk.json
file tells the CDK Toolkit how to execute your app.
npm run build
compile typescript to jsnpm run watch
watch for changes and compilenpm run test
perform the jest unit testscdk deploy
deploy this stack to your default AWS account/regioncdk diff
compare deployed stack with current statecdk synth
emits the synthesized CloudFormation template
See CONTRIBUTING for more information.
This library is licensed under the MIT-0 License. See the LICENSE file.