terraform-aws-postgres-loader-kinesis-ec2

A Terraform module which deploys a Snowplow Postgres Loader application on AWS running on top of EC2. If you want to use a custom AMI for this deployment you will need to ensure it is based on top of Amazon Linux 2.

WARNING: If you are upgrading from module version 0.1.x you will need to issue a manual table update - details can be found here. You will need to adjust the alter table command with the schema that your events table is deployed within.

Telemetry

This module by default collects and forwards telemetry information to Snowplow to understand how our applications are being used. No identifying information about your sub-account or account fingerprints are ever forwarded to us - it is very simple information about what modules and applications are deployed and active.

If you wish to subscribe to our mailing list for updates to these modules or security advisories please set the user_provided_id variable to include a valid email address which we can reach you at.

How do I disable it?

To disable telemetry simply set variable telemetry_enabled = false.

What are you collecting?

For details on what information is collected please see this module: https://github.com/snowplow-devops/terraform-snowplow-telemetry

Usage

The Postgres Loader can load both your enriched and bad data into a Postgres database - by default we are using RDS as it affords a simple and cost effective way to get started.

To start loading "enriched" data into Postgres:

module "enriched_stream" {
  source  = "snowplow-devops/kinesis-stream/aws"
  version = "0.2.0"

  name = "enriched-stream"
}

module "pipeline_rds" {
  source  = "snowplow-devops/rds/aws"
  version = "0.2.0"

  name        = "pipeline-rds"
  vpc_id      = var.vpc_id
  subnet_ids  = var.subnet_ids
  db_name     = local.pipeline_db_name
  db_username = local.pipeline_db_username
  db_password = local.pipeline_db_password

  # Note: this exposes your data to the internet - take care to ensure your allowlist is strict enough
  #       or provide a way to access the database through the VPC instead
  publicly_accessible     = true
  additional_ip_allowlist = local.pipeline_ip_allowlist
}

module "postgres_loader_enriched" {
  source = "snowplow-devops/postgres-loader-kinesis-ec2/aws"

  name       = "postgres-loader-enriched-server"
  vpc_id     = var.vpc_id
  subnet_ids = var.subnet_ids

  in_stream_name = module.enriched_stream.name

  # Note: The purpose defines what the input data set should look like
  purpose = "ENRICHED_EVENTS"

  # Note: This schema is created automatically by the VM on launch
  schema_name = "atomic"

  ssh_key_name     = "your-key-name"
  ssh_ip_allowlist = ["0.0.0.0/0"]

  # Linking in the custom Iglu Server here
  custom_iglu_resolvers = [
    {
      name            = "Iglu Server"
      priority        = 0
      uri             = "http://your-iglu-server-endpoint/api"
      api_key         = var.iglu_super_api_key
      vendor_prefixes = []
    }
  ]

  db_sg_id    = module.pipeline_rds.sg_id
  db_host     = module.pipeline_rds.address
  db_port     = module.pipeline_rds.port
  db_name     = local.pipeline_db_name
  db_username = local.pipeline_db_username
  db_password = local.pipeline_db_password
}

To load the "bad" data instead:

module "bad_1_stream" {
  source  = "snowplow-devops/kinesis-stream/aws"
  version = "0.2.0"

  name = "bad-1-stream"
}

module "postgres_loader_bad" {
  source = "snowplow-devops/postgres-loader-kinesis-ec2/aws"

  name       = "postgres-loader-bad-server"
  vpc_id     = var.vpc_id
  subnet_ids = var.subnet_ids

  in_stream_name = module.bad_1_stream.name

  # Note: The purpose defines what the input data set should look like
  purpose = "JSON"

  # Note: This schema is created automatically by the VM on launch
  schema_name = "atomic_bad"

  ssh_key_name     = "your-key-name"
  ssh_ip_allowlist = ["0.0.0.0/0"]

  # Linking in the custom Iglu Server here
  custom_iglu_resolvers = [
    {
      name            = "Iglu Server"
      priority        = 0
      uri             = "http://your-iglu-server-endpoint/api"
      api_key         = var.iglu_super_api_key
      vendor_prefixes = []
    }
  ]

  db_sg_id    = module.pipeline_rds.sg_id
  db_host     = module.pipeline_rds.address
  db_port     = module.pipeline_rds.port
  db_name     = local.pipeline_db_name
  db_username = local.pipeline_db_username
  db_password = local.pipeline_db_password
}

Increasing RDS capacity

As you load data into the database it will start to fill up naturally! To handle this seamlessly you can enabled auto-scaling for RDS by updating this module snippet as follows:

module "pipeline_rds" {
  source  = "snowplow-devops/rds/aws"
  version = "0.1.4"

  # Note: Enables autoscaling storage to up to 100gb from the default 10gb
  max_allocated_storage = 100

  name        = "pipeline-rds"
  vpc_id      = var.vpc_id
  subnet_ids  = var.subnet_ids
  db_name     = local.pipeline_db_name
  db_username = local.pipeline_db_username
  db_password = local.pipeline_db_password

  # Note: this exposes your data to the internet - take care to ensure your allowlist is strict enough
  #       or provide a way to access the database through the VPC instead
  publicly_accessible     = true
  additional_ip_allowlist = local.pipeline_ip_allowlist
}

Requirements

Name	Version
terraform	>= 1.0.0
aws	>= 3.72.0

Providers

Name	Version
aws	>= 3.72.0

Modules

Name	Source	Version
instance_type_metrics	snowplow-devops/ec2-instance-type-metrics/aws	0.1.2
kcl_autoscaling	snowplow-devops/dynamodb-autoscaling/aws	0.2.0
service	snowplow-devops/service-ec2/aws	0.2.0
telemetry	snowplow-devops/telemetry/snowplow	0.4.0

Resources

Name	Type
aws_cloudwatch_log_group.log_group	resource
aws_dynamodb_table.kcl	resource
aws_iam_instance_profile.instance_profile	resource
aws_iam_policy.iam_policy	resource
aws_iam_role.iam_role	resource
aws_iam_role_policy_attachment.policy_attachment	resource
aws_security_group.sg	resource
aws_security_group_rule.egress_tcp_443	resource
aws_security_group_rule.egress_tcp_80	resource
aws_security_group_rule.egress_tcp_server_rds	resource
aws_security_group_rule.egress_udp_123	resource
aws_security_group_rule.ingress_tcp_22	resource
aws_security_group_rule.rds_egress_tcp_webserver	resource
aws_caller_identity.current	data source
aws_region.current	data source

Inputs

Name	Description	Type	Default	Required
db_host	The hostname of the database to connect to	`string`	n/a	yes
db_name	The name of the database to connect to	`string`	n/a	yes
db_password	The password to use to connect to the database	`string`	n/a	yes
db_port	The port the database is running on	`number`	n/a	yes
db_sg_id	The ID of the RDS security group that sits downstream of the webserver	`string`	n/a	yes
db_username	The username to use to connect to the database	`string`	n/a	yes
in_stream_name	The name of the input kinesis stream that the Enricher will pull data from	`string`	n/a	yes
name	A name which will be pre-pended to the resources created	`string`	n/a	yes
purpose	The type of data the loader will be pulling which can be one of ENRICHED_EVENTS or JSON (Note: JSON can be used for loading bad rows)	`string`	n/a	yes
schema_name	The database schema to load data into (e.g atomic \| atomic_bad)	`string`	n/a	yes
ssh_key_name	The name of the SSH key-pair to attach to all EC2 nodes deployed	`string`	n/a	yes
subnet_ids	The list of subnets to deploy the Postgres Loader across	`list(string)`	n/a	yes
vpc_id	The VPC to deploy the Postgres Loader within	`string`	n/a	yes
amazon_linux_2_ami_id	The AMI ID to use which must be based of of Amazon Linux 2; by default the latest community version is used	`string`	`""`	no
associate_public_ip_address	Whether to assign a public ip address to this instance	`bool`	`true`	no
cloudwatch_logs_enabled	Whether application logs should be reported to CloudWatch	`bool`	`true`	no
cloudwatch_logs_retention_days	The length of time in days to retain logs for	`number`	`7`	no
custom_iglu_resolvers	The custom Iglu Resolvers that will be used by Enrichment to resolve and validate events	list(object({ name = string priority = number uri = string api_key = string vendor_prefixes = list(string) }))	`[]`	no
db_max_connections	The maximum number of connections to the backing database	`number`	`10`	no
default_iglu_resolvers	The default Iglu Resolvers that will be used by Enrichment to resolve and validate events	list(object({ name = string priority = number uri = string api_key = string vendor_prefixes = list(string) }))	[ { "api_key": "", "name": "Iglu Central", "priority": 10, "uri": "http://iglucentral.com", "vendor_prefixes": [] }, { "api_key": "", "name": "Iglu Central - Mirror 01", "priority": 20, "uri": "http://mirror01.iglucentral.com", "vendor_prefixes": [] } ]	no
enable_auto_scaling	Whether to enable auto-scaling policies for the service (WARN: ensure you have sufficient db_connections available for the max number of nodes in the ASG)	`bool`	`true`	no
iam_permissions_boundary	The permissions boundary ARN to set on IAM roles created	`string`	`""`	no
in_max_batch_size_checkpoint	The maximum number events to process before checkpointing progress on the stream	`number`	`1000`	no
in_max_batch_wait_checkpoint	The maximum amount of time to wait before checkpointing progress on the stream	`string`	`"10 seconds"`	no
initial_position	Where to start processing the input Kinesis Stream from (TRIM_HORIZON or LATEST)	`string`	`"TRIM_HORIZON"`	no
instance_type	The instance type to use	`string`	`"t3a.micro"`	no
java_opts	Custom JAVA Options	`string`	`"-Dorg.slf4j.simpleLogger.defaultLogLevel=info -XX:MinRAMPercentage=50 -XX:MaxRAMPercentage=75"`	no
kcl_read_max_capacity	The maximum READ capacity for the KCL DynamoDB table	`number`	`10`	no
kcl_read_min_capacity	The minimum READ capacity for the KCL DynamoDB table	`number`	`1`	no
kcl_write_max_capacity	The maximum WRITE capacity for the KCL DynamoDB table	`number`	`10`	no
kcl_write_min_capacity	The minimum WRITE capacity for the KCL DynamoDB table	`number`	`1`	no
max_size	The maximum number of servers in this server-group	`number`	`2`	no
min_size	The minimum number of servers in this server-group	`number`	`1`	no
scale_down_cooldown_sec	Time (in seconds) until another scale-down action can occur	`number`	`600`	no
scale_down_cpu_threshold_percentage	The average CPU percentage that we must be below to scale-down	`number`	`20`	no
scale_down_eval_minutes	The number of consecutive minutes that we must be below the threshold to scale-down	`number`	`60`	no
scale_up_cooldown_sec	Time (in seconds) until another scale-up action can occur	`number`	`180`	no
scale_up_cpu_threshold_percentage	The average CPU percentage that must be exceeded to scale-up	`number`	`60`	no
scale_up_eval_minutes	The number of consecutive minutes that the threshold must be breached to scale-up	`number`	`5`	no
ssh_ip_allowlist	The list of CIDR ranges to allow SSH traffic from	`list(any)`	[ "0.0.0.0/0" ]	no
tags	The tags to append to this resource	`map(string)`	`{}`	no
telemetry_enabled	Whether or not to send telemetry information back to Snowplow Analytics Ltd	`bool`	`true`	no
user_provided_id	An optional unique identifier to identify the telemetry events emitted by this stack	`string`	`""`	no

Outputs

Name	Description
asg_id	ID of the ASG
asg_name	Name of the ASG
sg_id	ID of the security group attached to the Postgres Loader servers

Copyright and license

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this software except in compliance with the License.

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

colmsnowplow/terraform-aws-postgres-loader-kinesis-ec2