terraform-aws-snowplow-databricks-pipeline

Overview

A Terraform module which deploys a pipeline to load Snowplow data into Databricks using the Snowplow Open Source artefacts.

This module builds the Collector application, the Enrich application and the Databricks Loader.

For more details on the Snowplow Pipeline, please visit their Official Documentation site:

https://docs.snowplow.io/docs/understanding-your-pipeline/architecture-overview-aws/

Databricks loader specific details and pre-requisites are documented here:

https://docs.snowplow.io/docs/destinations/warehouses-and-lakes/rdb/loading-transformed-data/databricks-loader/#setting-up-databricks

Usage

Import the module and provide the required configuration variables.

module "snowplow-databricks-pipeline" {
  source = "Datomni/snowplow-databricks-pipeline/aws"

  vpc_id             = var.vpc_id
  private_subnet_ids = var.private_subnet_ids
  public_subnet_ids  = var.public_subnet_ids

  s3_bucket_name = var.s3_bucket_name

  databricks_host      = var.databricks_host
  databricks_password  = var.databricks_password
  databricks_schema    = var.databricks_schema
  databricks_port      = var.databricks_port
  databricks_http_path = var.databricks_http_path
  iglu_server_url      = var.iglu_server_url
  iglu_server_apikey   = var.iglu_server_apikey
}

Examples

For a complete example, see examples/complete

Requirements

Name Version
terraform >= 1.3.5
aws >= 3.45.0
tls >= 4.0.4

Providers

Name Version
aws >= 3.45.0
tls >= 4.0.4

Modules

Name Source Version
bad_1_stream snowplow-devops/kinesis-stream/aws 0.3.0
bad_2_stream snowplow-devops/kinesis-stream/aws 0.3.0
collector_kinesis snowplow-devops/collector-kinesis-ec2/aws 0.4.0
collector_lb snowplow-devops/alb/aws 0.2.0
databricks_loader ./modules/databricks_loader n/a
enrich_kinesis snowplow-devops/enrich-kinesis-ec2/aws 0.4.0
enriched_stream snowplow-devops/kinesis-stream/aws 0.3.0
raw_stream snowplow-devops/kinesis-stream/aws 0.3.0
s3_loader_bad snowplow-devops/s3-loader-kinesis-ec2/aws 0.3.1
s3_loader_enriched snowplow-devops/s3-loader-kinesis-ec2/aws 0.3.1
s3_loader_raw snowplow-devops/s3-loader-kinesis-ec2/aws 0.3.1
s3_pipeline_bucket snowplow-devops/s3-bucket/aws 0.2.0
transformer_kinesis snowplow-devops/transformer-kinesis-ec2/aws 0.2.2

Resources

Name Type
aws_key_pair.pipeline resource
aws_sqs_queue.message_queue resource
tls_private_key.tls_key resource

Inputs

Name Description Type Default Required
associate_public_ip_address Whether to assign a public ip address to the resource. Required if resources are created in a public subnet bool false no
databricks_host Databricks Host string n/a yes
databricks_http_path Databricks http path string n/a yes
databricks_password Password for databricks_loader_user used by loader to perform loading string n/a yes
databricks_port Databricks port number 443 no
databricks_schema Databricks schema name string n/a yes
iam_permissions_boundary The permissions boundary ARN to set on IAM roles created string "" no
iglu_server_apikey Iglu Server API key string n/a yes
iglu_server_url Iglu Server url/dns string n/a yes
pipeline_kcl_write_max_capacity Increasing this is important to increase throughput at very high pipeline volumes number 10 no
private_subnet_ids The list of private subnets to deploy resources across list(string) n/a yes
public_subnet_ids The list of public subnets to deploy resources across list(string) n/a yes
s3_bucket_name S3 bucket with transformed snowplow events string n/a yes
ssl_information The ARN of an Amazon Certificate Manager certificate to bind to the load balancer
object({
enabled = bool
certificate_arn = string
})
{
"certificate_arn": "",
"enabled": false
}
no
tags The tags to append to the resources map(string) {} no
transformer_window_period_min Frequency to emit loading finished message - 5,10,15,20,30,60 etc minutes number 5 no
vpc_id The VPC to deploy resources within string n/a yes

Outputs

Name Description
collector_dns_name The ALB dns name for the Pipeline Collector