A Terraform module which deploys a pipeline to load Snowplow data into Databricks using the Snowplow Open Source artefacts.
This module builds the Collector application, the Enrich application and the Databricks Loader.
For more details on the Snowplow Pipeline, please visit their Official Documentation site:
https://docs.snowplow.io/docs/understanding-your-pipeline/architecture-overview-aws/
Databricks loader specific details and pre-requisites are documented here:
Import the module and provide the required configuration variables.
module "snowplow-databricks-pipeline" {
source = "Datomni/snowplow-databricks-pipeline/aws"
vpc_id = var.vpc_id
private_subnet_ids = var.private_subnet_ids
public_subnet_ids = var.public_subnet_ids
s3_bucket_name = var.s3_bucket_name
databricks_host = var.databricks_host
databricks_password = var.databricks_password
databricks_schema = var.databricks_schema
databricks_port = var.databricks_port
databricks_http_path = var.databricks_http_path
iglu_server_url = var.iglu_server_url
iglu_server_apikey = var.iglu_server_apikey
}
For a complete example, see examples/complete
Name | Version |
---|---|
terraform | >= 1.3.5 |
aws | >= 3.45.0 |
tls | >= 4.0.4 |
Name | Version |
---|---|
aws | >= 3.45.0 |
tls | >= 4.0.4 |
Name | Source | Version |
---|---|---|
bad_1_stream | snowplow-devops/kinesis-stream/aws | 0.3.0 |
bad_2_stream | snowplow-devops/kinesis-stream/aws | 0.3.0 |
collector_kinesis | snowplow-devops/collector-kinesis-ec2/aws | 0.4.0 |
collector_lb | snowplow-devops/alb/aws | 0.2.0 |
databricks_loader | ./modules/databricks_loader | n/a |
enrich_kinesis | snowplow-devops/enrich-kinesis-ec2/aws | 0.4.0 |
enriched_stream | snowplow-devops/kinesis-stream/aws | 0.3.0 |
raw_stream | snowplow-devops/kinesis-stream/aws | 0.3.0 |
s3_loader_bad | snowplow-devops/s3-loader-kinesis-ec2/aws | 0.3.1 |
s3_loader_enriched | snowplow-devops/s3-loader-kinesis-ec2/aws | 0.3.1 |
s3_loader_raw | snowplow-devops/s3-loader-kinesis-ec2/aws | 0.3.1 |
s3_pipeline_bucket | snowplow-devops/s3-bucket/aws | 0.2.0 |
transformer_kinesis | snowplow-devops/transformer-kinesis-ec2/aws | 0.2.2 |
Name | Type |
---|---|
aws_key_pair.pipeline | resource |
aws_sqs_queue.message_queue | resource |
tls_private_key.tls_key | resource |
Name | Description | Type | Default | Required |
---|---|---|---|---|
associate_public_ip_address | Whether to assign a public ip address to the resource. Required if resources are created in a public subnet | bool |
false |
no |
databricks_host | Databricks Host | string |
n/a | yes |
databricks_http_path | Databricks http path | string |
n/a | yes |
databricks_password | Password for databricks_loader_user used by loader to perform loading | string |
n/a | yes |
databricks_port | Databricks port | number |
443 |
no |
databricks_schema | Databricks schema name | string |
n/a | yes |
iam_permissions_boundary | The permissions boundary ARN to set on IAM roles created | string |
"" |
no |
iglu_server_apikey | Iglu Server API key | string |
n/a | yes |
iglu_server_url | Iglu Server url/dns | string |
n/a | yes |
pipeline_kcl_write_max_capacity | Increasing this is important to increase throughput at very high pipeline volumes | number |
10 |
no |
private_subnet_ids | The list of private subnets to deploy resources across | list(string) |
n/a | yes |
public_subnet_ids | The list of public subnets to deploy resources across | list(string) |
n/a | yes |
s3_bucket_name | S3 bucket with transformed snowplow events | string |
n/a | yes |
ssl_information | The ARN of an Amazon Certificate Manager certificate to bind to the load balancer | object({ |
{ |
no |
tags | The tags to append to the resources | map(string) |
{} |
no |
transformer_window_period_min | Frequency to emit loading finished message - 5,10,15,20,30,60 etc minutes | number |
5 |
no |
vpc_id | The VPC to deploy resources within | string |
n/a | yes |
Name | Description |
---|---|
collector_dns_name | The ALB dns name for the Pipeline Collector |