AWS ML batch pipeline starter

Overview

Starter for end-to-end AWS serverless technology-based pipeline for a batch model, supporting ECS or Kubernetes for model execution, and Spark running on EMR for ETL.

With the infrastructure automatically provisioned using IaC and CI/CD, you provide an implementation of the model, ETL, and validation using starter code, and changes will automatically deploy to your cluster.

Features

Status	Description
❌	AWS Step Function-based orchestration
❌	Lambda stubs for input and output validation
❌	CloudWatch-based scheduled events for execution
❌	Dockerized model execution in ECS
❌	Dockerized model execution in EKS
❌	ETL using Spark on EMR
✅	Input and output stored in S3
✅	Terraform for infrastructure management
❌	Jenkins CI/CD pipeline

Setup

Install local developer dependencies

These are necessary for local development, and will also be necessary for doing some setup prior to configuring the CI/CD pipeline.

Note that our CI/CD pipeline will automatically install these dependencies in the worker nodes.

(Recommended) Install and configure AWS CLI. (Tested with version 1.16.304.)
Install and configure Terraform. (Tested with version 0.12.19.)

Setup Terraform shared state management

Prior to setting up the automated deployment pipeline, you need to setup Terraform so that it can used S3 for storage of shared state. This will enable everyone working on your project to contribute infrastructure changes.

Though the bucket will be managed by Terraform, it cannot be created by Terraform because Terraform is dependent on the shared state that will be stored in the bucket; it's a catch-22. The workaround is to first create the S3 bucket, then initialize Terraform (which creates the shared state at terraform/storage/terraform.tfstate), and afterwards manually import the bucket into Terraform.

Manually create an S3 bucket, which will be used for storing Terraform state, as well as input and output data.
- E.g., using AWS CLI:
```
aws s3 mb s3://my-bucket --region us-east-1
```

Import the bucket to be under Terraform control

Open terminal, and change directory: cd storage/terraform

Initialize Terraform locally:

terraform init \
    -backend-config="bucket=my-bucket" \
    -backend-config="region=us-east-1"

Import the bucket you just created so that it is managed by Terraform:

terraform import \
    -var="bucket=my-bucket" \
    -var="region=us-east-1" \
    aws_s3_bucket.bucket my-bucket

(Recommended) Confirm Terraform is correctly setup by running plan:

terraform plan \
    -var="region=us-east-1" \
    -var="bucket=my-bucket"

If successful, you'll see a view that shows which changes will be made. To apply them now:

terraform apply -auto-approve \
    -var="region=us-east-1" \
    -var="bucket=my-bucket"

(Recommended) Confirm Terraform shared state was successfully created within the bucket.
- E.g., using AWS CLI:
```
aws s3 ls s3://my-bucket/terraform/storage/terraform.tfstate
```

Setup CI/CD using Jenkins

TODO

Environment Variables