/meltano-on-aws-batch

Running Meltano ELT on AWS Batch, infra with Terraform

Primary LanguageHCL

meltano-batch

A simple setup of Meltano Extract and Load on AWS Batch, managing the infrastructure with Terraform.

A service to setup a repeatable Meltano EL process, with smoke-tests installed. Runs the Meltano ELT process only, and does not provide a Meltano frontend (which as of writing is not essential)

If you are looking for an even simpler approach, then I strongly recommend taking a look at Meltano-on-Github-Actions as it is much simpler and requires less Devops hassle.

The only reason not to use Github actions is if you require much longer running loads, or control over the infrastructure specifications, or the movement to be fully contained within an AWS environment.

Prerequisites

  1. Select an AWS Region. Be sure that all required services (e.g. AWS Batch, AWS Lambda) are available in the Region selected.
  2. Install Docker.
  3. Install HashiCorp Terraform.
  4. Install the latest version of the AWS CLI and confirm it is properly configured.

Setup

  1. Setup terraform
git clone git@github.com:mattarderne/meltano-batch.git
cd meltano-batch/terraform
terraform init
  1. Run terraform, which will create all necessary infrastructure.
terraform plan 
terraform apply 

Build and Push Docker Image

Once finished, Terraform will output the name of your newly created ECR Repository, e.g. 123456789.dkr.ecr.eu-west-1.amazonaws.com/meltano-batch-ecr-repo:latest Note this value as we will use it in subsequent steps (referred to as MY_REPO_NAME):

cd ..
cd meltano

# build the docker image
docker build -t aws-batch-meltano .

# (optional) test the docker image
docker run \
    --volume $(pwd)/output:/project/output \
    aws-batch-meltano \
    elt tap-smoke-test target-jsonl

# tag the image
$ docker tag aws-batch-meltano:latest <MY_REPO_NAME>:latest

# login to the ECR, replace <region>
aws ecr get-login-password --region <region> | docker login --username AWS --password-stdin <MY_REPO_NAME>

# push the image to the ECR repository
docker push <MY_REPO_NAME>:latest

The above scripts are automated in the meltano/deploy_aws_ecr.sh script

Create a Job

Now that the docker image has been deployed to the ECR, you can invoke a job with the below, which will print the logs. Replace <region>

aws lambda invoke --function-name submit-job-smoke-test  --region <region> \
outfile --log-type Tail \
--query 'LogResult' --output text |  base64 -d

You should be able to view a list of the jobs with below command. (returns an empty list, no idea why, please let me know if you do!)

aws batch list-jobs --job-queue meltano-batch-queue 

Meltano UI

Load the Meltano UI to have a look. Currently only for display purposes, but can be configured to display the meltano app and kick-off adhoc jobs. Using Apprunner (example in terraform/archive/apprunner.tf) is viable for deploying to production, but requires a backend DB to be configured in the Dockerfile.

docker run \
    --volume $(pwd)/output:/project/output \
    aws-batch-meltano \
    ui

Resourcing

Depending on the size of the data transferred, you may need to increas the AWS Batch resource "aws_batch_job_definition" by editing the following fields from 2-8 vcpus and 2GB to 8GB ram

  "vcpus": 2 -> 8,
  "memory": 2000 -> 8000,

Notifications

By default there are no notifications set. Ideally this should be set by an AWS SNS system.

There is the capability to turn on Slack notifications as follows,

  1. Change the below line in elt_tap_smoke_test-target_jsonl.tf: handler = "lambda.lambda_handler" to handler = "alerts.lambda_handler"
  2. Change the below line in main.py: source_file = "lambda/lambda.py" to source_file = "alerts/lambda.py"
  3. Create a slack webhook create a secret.tfvars file in the lambda directory, adding the webhook url
slack_webhook = "<slack_webook>"
  1. Change the var.slack_webhook_toggle in variables.tf file to true (lowercase)
  2. Install requests in the terraform/lambda directory
cd terraform #must be run in terraform
pip install --target ./lambda requests
  1. Run terraform apply -var-file="secret.tfvars"

Test with aws lambda ... command above. It should ping to slack. However it only is pinging when the job starts (or fails to start), not the outcome of the job. Proper setup should be with AWS Batch SNS Notifications

Todo

AWS