MiniWDL AWS Batch and GPU support

Infrastructure deployment

EFS infrastructure

Deployment CLI (replace

somebody@someemail.com with your username/email for identification of allocated resources
miniwdl-bucket with desired name for S3 buckets for outputs. miniwdl-bucket is default ):

# Create a new VPC and deploy MiniWDL infrastructure in this VPC
aws cloudformation deploy --template-file cfn-miniwdl-new-vpc.yaml \
    --stack-name MiniWDL-new-VPC --capabilities CAPABILITY_NAMED_IAM \
    --parameter-overrides  S3UploadBucket=miniwdl-bucket Owner=somebody@someemail.com
aws cloudformation describe-stacks --stack-name MiniWDL

# Deploy MiniWDL infrastructure in existing VPC
aws cloudformation deploy --template-file cfn-miniwdl.yaml \
    --stack-name MiniWDL --capabilities CAPABILITY_NAMED_IAM \
    --parameter-overrides  \
    S3UploadBucket=miniwdl-bucket \
    Owner=somebody@someemail.com  \
    Subnet0=subnet-0b2ad3bbbe3652a00 \
    Subnet1=subnet-0f6db482bddf223c8 \
    SecurityGroupId=sg-058deaa09fcdadc69

aws cloudformation describe-stacks --stack-name MiniWDL

FSx for Lustre infrastructure

NOT YET FINISHED AND IS NOT WORKING

# Deploy MiniWDL FSx for Lustre infrastructure in existing VPC
aws cloudformation deploy --template-file cfn-miniwdl-fsx.yaml \
    --stack-name MiniWDL-fsx --capabilities CAPABILITY_NAMED_IAM \
    --parameter-overrides  \
    S3UploadBucket=miniwdl-bucket \
    Owner=somebody@someemail.com  \
    SubnetId=subnet-0b2ad3bbbe3652a00 \
    SecurityGroupId=sg-058deaa09fcdadc69

miniwdl-aws-submit --no-efs \
  --workflow-queue miniwdl-lustre-workflow \
  --self-test --follow  

aws cloudformation describe-stacks --stack-name MiniWDL-fsx

Install latest version

pip install git+https://github.com/staskh/miniwdl-aws.git

Test deployment

Replace --s3upload value with one selected in infrastructure deployment.

to test your setup, run

miniwdl-aws-submit --self-test --follow --workflow-queue miniwdl-workflow

the same, but explicit test can be perfomed with

miniwdl-aws-submit --verbose --no-cache --follow --s3upload s3://miniwdl-bucket/self_test https://raw.githubusercontent.com/staskh/miniwdl-aws/main/test_workflow/self_test/test.wdl who=https://raw.githubusercontent.com/chanzuckerberg/miniwdl/main/tests/alyssa_ben.txt

to test GPU-based workflow, run

miniwdl-aws-submit --verbose --no-cache --follow --s3upload s3://miniwdl-bucket/gpu_test  https://raw.githubusercontent.com/staskh/miniwdl-aws/main/test_workflow/gpu_test/gpu_test.wdl

Fork improvements

CloudFormation template for cloud setup

Deployment script for miniwdl-aws cloud, replacement Terraform-based script miniwdl-aws-terraform]

Support for GPU-based tasks

See WDL example at https://github.com/staskh/miniwdl-aws/tree/main/test_workflow/gpu_test

miniwdl AWS plugin

Extends miniwdl to run workflows on AWS Batch and EFS

This miniwdl plugin enables it to execute WDL tasks as AWS Batch jobs. It uses EFS for work-in-progress file I/O, optionally uploading final workflow outputs to S3.

Before diving into this, first consider Amazon Omics, which includes a WDL workflow runner service that doesn't need you to deploy compute infrastructure in your AWS account. (The behind-the-scenes implementation differs from the plugin found here.)

There are a few ways to deploy this miniwdl-aws plugin:

Amazon Genomics CLI

Amazon Genomics CLI can deploy a miniwdl-aws context into your AWS account with all the necessary infrastructure.

Amazon SageMaker Studio

Or, try the miniwdl-aws-studio recipe to install miniwdl for interactive use within Amazon SageMaker Studio, a web IDE with a terminal and filesystem browser. You can use the terminal to operate miniwdl run against AWS Batch, the filesystem browser to manage the inputs and outputs on EFS, and the Jupyter notebooks to further analyze the outputs.

`miniwdl-aws-submit` with custom infrastructure

Lastly, advanced operators can use miniwdl-aws-terraform to deploy/customize the necessary AWS infrastructure, including a VPC, EFS file system, Batch queues, and IAM roles.

In this scheme, a local command-line wrapper miniwdl-aws-submit launches miniwdl in its own small Batch job to orchestrate a workflow. This workflow job then spawns WDL task jobs as needed, without needing the submitting laptop to remain connected for the duration. The workflow jobs run on lightweight Fargate resources, while task jobs run on EC2 spot instances.

Submitting workflow jobs

After deploying miniwdl-aws-terraform, pip3 install miniwdl-aws locally to make the miniwdl-aws-submit program available. Try the self-test:

miniwdl-aws-submit --self-test --follow --workflow-queue miniwdl-workflow

Then launch a viral genome assembly that should run in 10-15 minutes:

miniwdl-aws-submit \
  https://github.com/broadinstitute/viral-pipelines/raw/v2.1.28.0/pipes/WDL/workflows/assemble_refbased.wdl \
  reads_unmapped_bams=https://github.com/broadinstitute/viral-pipelines/raw/v2.1.19.0/test/input/G5012.3.testreads.bam \
  reference_fasta=https://github.com/broadinstitute/viral-pipelines/raw/v2.1.19.0/test/input/ebov-makona.fasta \
  sample_name=G5012.3 \
  --workflow-queue miniwdl-workflow \
  --s3upload s3://MY-BUCKET/assemblies \
  --verbose --follow

The command line resembles miniwdl run's with extra AWS-related arguments:

--workflow-queue Batch job queue on which to schedule the workflow job; output from miniwdl-aws-terraform, default miniwdl-workflow. (Also set by environment variable MINIWDL__AWS__WORKFLOW_QUEUE)
--follow live-streams the workflow log instead of exiting immediately upon submission. (--wait blocks on the workflow without streaming the log.)
--s3upload (optional) S3 folder URI under which to upload the workflow products, including the log and output files (if successful). The bucket must be allow-listed in the miniwdl-aws-terraform deployment.
- Unless --s3upload ends with /, one more subfolder is added to the uploaded URI prefix, equal to miniwdl's automatic timestamp-prefixed run name. If it does end in /, then the uploads go directly into/under that folder (and a repeat invocation would be expected to overwrite them).

miniwdl-aws-submit detects other infrastructure details (task queue, EFS access point, IAM role) based on the workflow queue; see miniwdl-aws-submit --help for additional options to override those defaults.

If the specified WDL source code is an existing local .wdl or .zip file, miniwdl-aws-submit automatically ships it with the workflow job as the WDL to execute. Given a .wdl file, it runs miniwdl zip to detect & include any imported WDL files; while it assumes .zip files were also generated by miniwdl zip. If the source code is too large to fit in the AWS Batch request payload (~50KB), then you'll instead have to pass it by reference to a URL or EFS path.

Arguments not consumed by miniwdl-aws-submit are passed through to miniwdl run inside the workflow job; as are environment variables whose names begin with MINIWDL__, allowing override of any miniwdl configuration option (disable wih --no-env). See miniwdl_aws.cfg for various options preconfigured in the workflow job container.

The workflow and task jobs all mount EFS at /mnt/efs. Although workflow input files are usually specified using HTTPS or S3 URIs, files already resident on EFS can be used with their /mnt/efs paths (which probably don't exist locally on the submitting machine). Unlike the WDL source code, miniwdl-aws-submit will not attempt to ship/upload local input files.

Run directories on EFS

Miniwdl runs the workflow in a directory beneath /mnt/efs/miniwdl_run (override with --dir). The outputs also remain cached there for potential reuse in future runs (to avoid, submit with --no-cache or wipe /mnt/efs/miniwdl_run/_CACHE).

Given the EFS-centric I/O model, you'll need a way to browse and manage the filesystem contents remotely. The companion recipe lambdash-efs is one option; miniwdl-aws-terraform outputs the infrastructure details needed to deploy it (pick any subnet). Or, set up an instance/container mounting your EFS, to access via SSH or web app (e.g. JupyterHub, Cloud Commander, VS Code server).

You can also automate cleanup of EFS run directories by setting miniwdl-aws-submit --s3upload and:

--delete-after success to delete the run directory immediately after successful output upload
--delete-after failure to delete the directory after failure
--delete-after always to delete it in either case
(or set environment variable MINIWDL__AWS__DELETE_AFTER_S3_UPLOAD)

Deleting a run directory after success prevents the outputs from being reused in future runs. Deleting it after failures can make debugging more difficult (although logs are retained, see below).

Security note on file system isolation

Going through AWS Batch & EFS, miniwdl can't enforce the strict file system isolation between WDL task containers that it does locally. All the AWS Batch containers have read/write access to the entire EFS file system (as viewed through the access point), not only their initial working directory.

This is usually benign, because WDL tasks should only read their declared inputs and write into their respective working/temporary directories. But poorly- or maliciously-written tasks could read & write files elsewhere on EFS, even changing their own input files or those of other tasks. This risks unintentional side-effects or worse security hazards from untrusted code.

To mitigate this, test workflows thoroughly using the local backend, which strictly isolates task containers' file systems. If WDL tasks insist on modifying their input files in place, then --copy-input-files can unblock them (at a cost in time, space, and IOPS). Lastly, avoid using untrusted WDL code or container images; but if they're necessary, then use a separate EFS access point and restrict the IAM and network configuration for the AWS Batch containers appropriately.

EFS performance considerations

To scale up to larger workloads, it's important to study AWS documentation on EFS performance and monitoring. Like any network file system, EFS limits on throughput and IOPS can cause bottlenecks; and worse, exhaustion of the default bursting throughput mode can effectively freeze a workflow.

Management tips:

Monitor file system throughput limits, IOPS, and burst credits in the EFS area of the AWS Console.
Stage large datasets onto the file system well in advance, increasing the available burst throughput.
Enable the Elastic or Provisioned throughput modes (at increased cost)
Code WDL tasks to write any purely-temporary files into $TMPDIR, which may use local scratch space, instead of the EFS working directory.
Configure miniwdl and AWS Batch to limit the number of concurrent jobs and/or the rate at which they turn over (see miniwdl_aws.cfg for relevant details).
Spread out separate workflow runs over time or across multiple EFS file systems.

FSx for Lustre and other shared filesystems

If EFS performance remains insufficient, then you can configure your Batch compute environments to automatically mount some other shared filesystem upon instance startup. Then use miniwdl-aws-submit --no-efs to make it assume the filesystem will already be mounted at a certain location (default --mount /mnt/net) across all instances. In this case, the compute environment for workflow jobs is expected to use EC2 instead of Fargate resources (usually necessary for mounting).

The miniwdl-aws-terraform repo includes a variant setting this up with FSx for Lustre. FSx offers higher throughput scalability, but has other downsides compared to EFS (higher upfront costs, manual capacity scaling, single-AZ deployment, fewer AWS service integrations).

Logs & troubleshooting

If the terminal log isn't available (through Studio or miniwdl-submit-awsbatch --follow) to trace a workflow failure, look for miniwdl's usual log files written in the run directory on EFS or copied to S3.

Each task job's log is also forwarded to CloudWatch Logs under the /aws/batch/job group and a log stream name reported in miniwdl's log. Using miniwdl-aws-submit, the workflow job's log is also forwarded. CloudWatch Logs indexes the logs for structured search through the AWS Console & API.

Misconfigured infrastructure might prevent logs from being written to EFS or CloudWatch at all. In that case, use the AWS Batch console/API to find status messages for the workflow or task jobs.

Contributing

Pull requests are welcome! For help, open an issue here or drop in on #miniwdl in the OpenWDL Slack.

Code formatting and linting. To prepare your code to pass the CI checks,

pip3 install --upgrade -r test/requirements.txt
pre-commit run --all-files

Running tests. In an AWS-credentialed terminal session,

MINIWDL__AWS__WORKFLOW_QUEUE=miniwdl-workflow test/run_tests.sh

This builds the requisite Docker image from the current code revision and pushes it to an ECR repository (which must be prepared once with aws ecr create-repository --repository-name miniwdl-aws). To test an image from the GitHub public registry or some other version, set MINIWDL__AWS__WORKFLOW_IMAGE to the desired tag.

staskh/miniwdl-aws