/terraform-aws-vault-ha-raft

Hashicorp Vault HA cluster based on Raft Consensus Algorithm

Primary LanguageHCLMIT LicenseMIT

Hashicorp Vault HA cluster based on Raft Consensus Algorithm

License GitHub tag GitHub release Last Commit GitHub commit activity

languages Count Languages Top Code Size Repo Size

Vault Logo

Vault HA cluster is based on Raft Storage Backend announced tech preview on 1.2.0 (July 30th, 2019), introduced a beta on 1.3.0 (November 14th, 2019)) and promoted out of beta on 1.4.0 (April 7th, 2020)

The Raft storage backend is used to persist Vault's data. Unlike other storage backends, Raft storage does not operate from a single source of data. Instead all the nodes in a Vault cluster will have a replicated copy of Vault's data. Data gets replicated across the all the nodes via the Raft Consensus Algorithm.

  • High Availability – the Raft storage backend supports high availability.
  • HashiCorp Supported – the Raft storage backend is officially supported by HashiCorp.

Key features:

  • Can be run with low consumption of costs or even just on AWS Free Tier
  • No external dependencies like Consul, ETCD, database, etc for storing data
  • No need additional provisioning tools like Ansible, Chief of Puppet, all based on clear Terraform
  • Module fully independent with zero-external resources dependencies and just optional like ACM for HTTPS etc
  • Provisioning based on CoreOS ignitions so very fast, declarative and predictable
  • Fast and easily manual creating snapshots (backups) from Vault UI (thanks Raft implementations)
  • Integrated auto backups of data by Amazon snapshots and fully configurable scheduler (false by default)
  • Integrated optional auto-unseal with built-in AWS KMS provisioning, external AWS KMS or Transit secret backend by another Vault
  • Vault Raft data is storing on separate EBS volumes independent from a root filesystem
  • Easily increase and decrease the count of nodes without losing of data just by running terraform apply (with some downtime)
  • The cluster can be running-up from just one node to N-nodes with proportional distribution by availability zones in a region
  • No data lost happens even all instances will be terminated by mistake (just need to redeploy by Terraform for restoring a cluster)
  • Possible to upgrade or downgrade a Vault version across all cluster without losing of data (with some downtime)
  • Communication between peers (nodes) encrypted by TLS v1.2+ with certificate-based client authentication by RSA-2048* (bidirectional TLS encryption and authentication)
  • Communication between cluster and ALB (Load Balancer) encrypted by TLS v1.2+
  • Free Amazon certificate (ACM) can be assigned to ALB for client-server encryption
  • By default, all nodes are hidden in private subnet and just one port on ALB accessible outside (best AWS security practice)
  • Optional generation SSH pair (RSA-4096*) and assigned to nodes (not recommended, better to provisioning external SSH public key)
  • Access to instances by SSH also can by provisioning Root CA* certificate and principals
  • Provided assigning external own Root CA* for cases when Vault need secure communicate with internal infrastructure
  • For better security provided support for disable_mlock=false by default
  • Optional can enable assigning public network for all nodes. SSH and HTTPS port will be available publicly (for debugging and development only)
  • Implemented debugging options with full audit of all configurations files and certificates which the cluster deploying on (for debugging and development only)

* looking here some limitations regarding AWS provisioning

Why?

Why not use a Kubernetes or other current cluster? For this, I can name a few reasons:

  1. Independence. For creating infrastructure as a code by Terraform (for example the same cluster) we need storage for storing secret input parameters (passwords, IPs, private data) and outputs (tokens, endpoints, passwords). For this very convenient to use a Vault.
  2. Stability. Cluster is an additional layer of abstraction across all EC2 instances. Much better using native methods of deploying.
  3. Security. Vault might be storing very secret and sensitive data. Putting this data together with publicly available services may carry potential risks of leaks. In this case, we can deploy a cluster totally independent even in a separate AWS account with access to which is limited to a few people.
  4. Lightweight. Sometimes we need a very lightweight and cheap Vault and at the same time very stable. E.g. just for auto-unseal another Vault.

AWS Permissions

For deploying you need a list of permissions. For beginners might be difficult to set up minimal need permissions, so here the list wildcard for main actions. For professional or those who interesting for high-level security and granular permissions looking this AWS IAM Granular Permissions

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "VaultHAProvisioning",
      "Effect": "Allow",
      "Action": [
        "ec2:*",
        "DLM:*",
        "elasticloadbalancing:*",
        "iam:*",
        "kms:*",
        "route53:*",
        "sts:GetCallerIdentity"
      ],
      "Resource": "*"
    }
  ]
}

Usage

IMPORTANT: The last code from master might need to temporary enable the option nat_enabled (access to external resources) at the first initialization since, during the creation of the cluster, the instance needs to get a docker image. An alternative could be placing the cluster on a public subnet

The module can be deployed with almost default values of variables. For more details of the default values looking here

provider "aws" {
  region = "us-east-1"
}

module "vault_ha" {
  source = "github.com/binlab/terraform-aws-vault-ha-raft?ref=v0.1.0"

  cluster_name        = "vault-ha"
  node_instance_type  = "t3a.small"
  autounseal          = true
}

output "cluster_url" {
  value = module.vault_ha.cluster_url
}

Then run:

$ terraform init
$ terraform apply

After deploying the process you should see:

...
cluster_url = http://tf-vault-ha-alb-123456789.us-east-1.elb.amazonaws.com:443
$

Then just open URL in a browser and initialize the cluster

ATTENTION! Some resources cannot be covered by Amazon Free Tier or not Free usage and cost a money so after running this example should destroy all resources created previously by next command:

$ terraform destroy

Examples

  1. Assigning CNAME and Route 53 Alias to Vault HA cluster
  2. Assigning module's VPC to external resources e.g. Bastion host
  3. VPC Peering different networks e.g. RDS Database
  4. Assigning Vault cluster to inside an already created (external) AWS VPC

TODO

  • Add examples of use with different cases #10
  • Hosted module on Terraform Registry #13
  • Add validation of input data in variables.tf
  • Add support Fedora CoreOS as announced CoreOS Container Linux will reach its end of life on May 26, 2020 and will no longer receive updates.
  • Remove external dependency - VPC Module - #7
  • Add support for external AWS VPC - #4
  • Add an option to configure preferred AWS availability zones - #36
  • Add configuration for an external Vault Audit Device via syslog or socket
  • Third-party plugins installation support
  • Add optional opened HTTP port on ALB and setup redirect from HTTP to HTTPS. Canonical support
  • Disable NAT Gateway by default (for reducing costs consumptions and security improvement) - #27
  • Option to disable Route 53 internal zone for (reducing costs consumptions)
  • Add EFS storage support as a persistent Raft data storage
  • Add option to disable creating an additional EFS (for reducing costs consumptions)
  • Add option to store Raft data in a temporary memory (RAM) - paranoid mode
  • Implement a provisioning internal Intermediate CA* for signing nodes certificates
  • Implement scheduler backup of data by embedded snapshot operator and storing to S3 Bucket (reducing costs consumptions)
  • Replace creating EC2 instances to an autoscaling group (might cost some limitation)
  • Auto-provisioning a cluster on the first installation with storing Token and Unseal keys by GPG/PGP or Keybase
  • Add support of OpenStack (OS) Terraform module
  • Add support of Google Cloud Platform (GCP) Terraform module
  • Add support of Microsoft Azure Terraform module
  • Add support of AliCloud Terraform module
  • Add support of Oracle Cloud (OCI) Terraform module
  • Add support of Docker by Terraform (for local development and testing)
  • Multi-regional support of cluster for super high availability
  • Auto-deletion nodes in a cluster in time of decreasing count of nodes

* looking here some limitations regarding AWS provisioning

Limitations

  • Because AWS strictly limiting the size of User Data file, we can't put into the ignition file a very big certificate and keys.

    User data is limited to 16 KB, in raw form, before it is base64-encoded. The size of a string of length n after base64-encoding is ceil(n/3)*4. source

    So in this case we need to select the size experimentally. For exactly knows size of the file, you can use debug mode

  • Requirements block and versions.tf may not accurately display a real minimum version of providers. A declared versions ware just an installed in the time of development and testing of the module and can give guaranties of working with this or higher version. If you use older versions of modules for some reason and can give some guarantees of working with it, please create an issue for downscaling some version to minimal needed.

  • According to the opened issue be careful with Tags settings, any changes after creating the tags may have a trigger effect on the change of value. Until the problem is closed by the Terraform team, a temporary workaround is applied and it is best to determine the tag names in advance

Requirements

Name Version
terraform >= 0.12
aws >= 2.53.0
ignition >= 1.2.1
local >= 1.4.0
tls >= 2.1.1

Providers

Name Version
aws >= 2.53.0
ignition >= 1.2.1
local >= 1.4.0
tls >= 2.1.1

Modules

No modules.

Resources

Name Type
aws_dlm_lifecycle_policy.snapshots resource
aws_ebs_volume.data resource
aws_eip.nat resource
aws_iam_instance_profile.autounseal resource
aws_iam_role.autounseal resource
aws_iam_role.snapshots resource
aws_iam_role_policy.autounseal resource
aws_iam_role_policy.snapshots resource
aws_instance.node resource
aws_internet_gateway.public resource
aws_kms_key.autounseal resource
aws_lb.cluster resource
aws_lb_listener.cluster resource
aws_lb_target_group.cluster resource
aws_lb_target_group_attachment.cluster resource
aws_nat_gateway.private resource
aws_route.private resource
aws_route.public resource
aws_route53_record.ext resource
aws_route53_record.int resource
aws_route53_zone.int resource
aws_route_table.private resource
aws_route_table.public resource
aws_route_table_association.private resource
aws_route_table_association.public resource
aws_security_group.alb resource
aws_security_group.node resource
aws_security_group.public resource
aws_security_group.vpc resource
aws_subnet.private resource
aws_subnet.public resource
aws_volume_attachment.node resource
aws_vpc.this resource
local_file.ca_cert resource
local_file.config resource
local_file.node_cert resource
local_file.node_key resource
local_file.ssh_private_key resource
local_file.user_data resource
tls_cert_request.node resource
tls_locally_signed_cert.node resource
tls_private_key.ca resource
tls_private_key.core resource
tls_private_key.node resource
tls_self_signed_cert.ca resource
aws_ami.coreos data source
aws_ami.flatcar data source
aws_availability_zones.current data source
aws_iam_policy_document.autounseal data source
aws_iam_policy_document.autounseal_sts data source
aws_iam_policy_document.snapshots data source
aws_iam_policy_document.snapshots_sts data source
ignition_config.node data source
ignition_file.auth_principals_admin data source
ignition_file.auth_principals_core data source
ignition_file.ca_ssh_public_keys data source
ignition_file.ca_tls_public_keys data source
ignition_file.config data source
ignition_file.helper data source
ignition_file.node_ca data source
ignition_file.node_cert data source
ignition_file.node_key data source
ignition_file.sshd_config data source
ignition_file.update_config data source
ignition_filesystem.data data source
ignition_systemd_unit.mount data source
ignition_systemd_unit.service data source
ignition_user.admin data source
ignition_user.core data source

Inputs

Name Description Type Default Required
ami_channel AMI filter for OS channel [stable/edge/beta/etc] string "stable" no
ami_image Specific AMI image ID in current Avalability Zone e.g. [ami-123456]
If provided nodes will be run on it, for cases when image built by
Packer if set it will disable search images by "ami_vendor" and
"ami_channel". Note: Instance OS should support CoreOS Ignition
provisioning
string "" no
ami_vendor AMI filter for OS vendor [coreos/flatcar] string "flatcar" no
autounseal Option to enable/disable creating KMS key, IAM role, policy and
AssumeRole for autounseal by AWS. Instead of creating by module,
can be used external resources for autounseal or without it at all.
If set will disable "seal_transit" and "seal_awskms".
bool false no
aws_snapshots Option to enable/disable embedded snapshots by AWS bool false no
aws_snapshots_interval Snapshot Interval. How often this lifecycle policy
should be evaluated. 2,3,4,6,8,12 or 24 are valid values
number 24 no
aws_snapshots_retain How many snapshots to keep. Must be an integer between 1 and 1000 number 7 no
aws_snapshots_time A list of times in 24 hour clock format that sets when the
lifecycle policy should be evaluated. Max of 1 by UTC time
string "23:45" no
ca_ssh_public_keys List of SSH Certificate Authority public keys. Specifies a public
keys of certificate authorities that are trusted to sign
user certificates for authentication. More:
https://man.openbsd.org/sshd_config#TrustedUserCAKeys
list(string) [] no
ca_tls_public_keys List of custom Certificate Authority public keys. Used when need
to connect from Vault to resources with a self-signed certificate
list(string) [] no
certificate_arn ARN of AWS certificate for assigning to ALB to determine TLS
connection. It should be a certificate issued for a domain that
will be assigned as CNAME record to ALB endpoint. If not set TLS
not be activated on ALB. More:
https://www.terraform.io/docs/providers/aws/r/\
acm_certificate_validation.html#certificate_arn
string "" no
cluster_allowed_subnets Allowed IPs to connect to a cluster on ALB endpoint list(string)
[
"0.0.0.0/0"
]
no
cluster_count Count of nodes in cluster across all availability zones number 3 no
cluster_description Description for Tags in all resources.
Also used as a prefix for certificates "common_name",
"organizational_unit" and "organization" fields
string "Hashicorp Vault HA Cluster" no
cluster_domain Public cluster domain that will be assigned as CNAME record to
ALB endpoint. If not set ALB endpoint will be used
string "" no
cluster_name Name of a cluster, and tag "Name", can be a project name.
Format of "Name" tag "<cluster_prefix>-<cluster_name>-"
string "vault-ha" no
cluster_port External port on ALB endpoint to a public connection number 443 no
cluster_prefix Prefix of a tag "Name", can be a namespace.
Format of "Name" tag "<cluster_prefix>-<cluster_name>-"
string "tf-" no
create_route53_external Creating external route53 record bool false no
data_volume_size Data (Raft) volume block device Size (GB) e.g. [8] number 8 no
data_volume_type Data (Raft) volume block device Type e.g. [gp2] string "gp2" no
debug Option for enabling debug output to plain files. When "true"
Terraform will store certificates, keys, ignitions files
(user data) JSON file to a folder "debug_path"
bool false no
debug_path Path to folder where will be stored debug files.
If is empty then default "${path.module}/.debug"
you can set custom full path e.g. "/home/user/.debug"
string "" no
disable_mlock Disables the server from executing the "mlock" syscall. Mlock
prevents memory from being swapped to disk. Disabling "mlock" is
not recommended in production, but is fine for local development
and testing
bool false no
docker_repo Vault Docker repository URI string "docker://vault" no
docker_tag Vault Docker image version tag string "1.7.3" no
internal_zone Name for internal domain zone. Need for assigning domain names
to each of nodes for cluster server-to-server communication.
Also used for SSH connection over Bastion host.
string "vault.int" no
internet_gateway_id_external Provide existing external internet gateway ID for AWS VPC string null no
nat_enabled Determines to enable or disable creating NAT gateway and assigning
it to VPC Private Subnet. If you intend to use Vault only with
internal resources and internal network, you can disable this option
otherwise, you need to enable it. Allowing external routing might be
a potential security vulnerability. Also, enabling these options
will be additional money costs and not covered by the AWS Free Tier
program.
IMPORTANT: since during the creation of the cluster, the instance
needs to get a docker image, then it is necessary to enable
nat_enabled at the first initialization
bool false no
node_allow_public Assign public network to nodes (EC2 Instances). EC2 will be
available publicly with HTTPS "node_port" ports and SSH "ssh_port".
For debugging only, don't use on production!
bool false no
node_allowed_subnets If variable "node_allow_public" is set to "true" - list of these
IPs will be allowed to connect to Vault node directly (to instances)
list(string)
[
"0.0.0.0/32"
]
no
node_cert_hours_valid The number of hours after initial issuing that the certificate
will become invalid for Vault node. The certificate used for
internal communication in a cluster by peers and to connect from
ALB. Not recommended set a small value as there is no reissuance
mechanism without applying of the Terraform
number 43800 no
node_cpu_credits The credit option for CPU usage [unlimited/standard] string "standard" no
node_instance_type Type of instance e.g. [t3.small] string "t3.small" no
node_monitoring CloudWatch detailed monitoring [true/false] bool false no
node_name_tmpl Template of Vault node ID for a Raft cluster. Also used as a
subdomain prefix for internal domains for example:
"node0.vault.int", "node1.vault.int", etc
string "node%d" no
node_port Vault listens for ALB and health check requests number 8200 no
node_volume_size Node (Root) volume block device Size (GB) e.g. [8] number 8 no
node_volume_type Node (Root) volume block Device Type e.g. [gp2] string "gp2" no
peer_port Vault listens for server-to-server cluster requests number 8201 no
route53_zone_id_external External route53 zone ID string "" no
seal_awskms Map for an assignment for Vault to use AWS KMS as the seal
wrapping mechanism. If set will disable "seal_transit".
More: https://www.vaultproject.io/docs/configuration/seal/awskms
map(any) {} no
seal_transit Map for assignment Transit seal configuration for use Vault's
Transit Secret Engine as the autoseal mechanism.
More: https://www.vaultproject.io/docs/configuration/seal/transit
map(any) {} no
ssh_admin_principals List of SSH authorized principals for user "Core" when SSH login
configured via Certificate Authority ("ca_ssh_public_key" is set)
https://man.openbsd.org/sshd_config#AuthorizedPrincipalsFile
list(string)
[
"vault-ha"
]
no
ssh_allowed_subnets If variable "node_allow_public" is set to "true" - list of these
IPs will be allowed to connect to Vault node by SSH directly (to
instances)
list(string)
[
"0.0.0.0/32"
]
no
ssh_authorized_keys List of SSH authorized keys assigned to "Core" user (sudo user) list(string) [] no
ssh_core_principals List of SSH authorized principals for user "Admin" when SSH login
configured via Certificate Authority ("ca_ssh_public_key" is set)
More: https://man.openbsd.org/sshd_config#AuthorizedPrincipalsFile
list(string)
[
"sudo"
]
no
ssh_port Listening SSH port on instancies in public and private networks.
Changes used only when "ca_ssh_public_key" set otherwise it equal
to 22 as default
number 22 no
tags Map of tags assigned to each or created resources in AWS.
By default, used predefined described map in a file "locals.tf".
Each of them can be overwritten here separately.
map(string) {} no
vault_ui Enables the built-in Vault web UI bool true no
vpc_cidr VPC CIDR associated with a module. Block sizes must be between a
/16 netmask and /28 netmask for AWS. For example:
10.0.0.0/16-10.0.0.0/28,
172.16.0.0/16-172.16.0.0/28,
192.168.0.0/16-192.168.0.0/28
string "192.168.0.0/16" no
vpc_id_external Provide existing external AWS VPC ID. If so configure corresponding
vpc_public_subnet_cidr and vpc_private_subnet_cidr to match
external VPC CIDR
string null no
vpc_private_subnet_cidr CIDR block for private subnet, must be canonical form, be in the same
network with VPC and non-overlapping with other subnets. For example:
subnet /25, (e.g. 172.31.31.0/25) can contain up to 16 subnets
with a mask /28 (subnet mask must be not less than /28 for AWS)
string null no
vpc_private_subnet_mask Size of private subnet. The subnet mask must be not less than /28
for AWS. Mask /28 can contain up to 16 IP addresses but AWS reserved
5 addresses so 11 available for user. More:
https://docs.aws.amazon.com/vpc/latest/userguide/VPC_Subnets.html
number 28 no
vpc_private_subnet_tmpl VPC Private Subnet Template. Created for convenient use for a person
who is quite not enough familiar with networks and subnetworks.
Each index from the list of availability zones will be replaced
accordingly instead of the placeholder %d. Will be ignored if
variable vpc_private_subnets defined.
DEPRICETED: Try to avoid use this configuration, might be removed
in next versions. In this case, to avoid re-creations of cluster,
just describe your exists networks by vpc_public_subnets
parameters list for example:
["192.168.101.0/24", "192.168.102.0/24", "192.168.103.0/24", ...]
string "192.168.10%d.0/24" no
vpc_private_subnets List of VPC Private Subnet. Each subnet will be assigned to
availability zone in order.
Mask must be not less than /28 for AWS. Subnets should not overlap
and should be in the same network with vpc_cidr
list(string) [] no
vpc_public_subnet_cidr CIDR block for public subnet, must be canonical form, be in the same
network with VPC and non-overlapping with other subnets. For example:
subnet /25, (e.g. 172.31.31.0/25) can contain up to 16 subnets
with a mask /28 (subnet mask must be not less than /28 for AWS)
string null no
vpc_public_subnet_mask Size of public subnet. The subnet mask must be not less than /28
for AWS. Mask /28 can contain up to 16 IP addresses but AWS reserved
5 addresses so 11 available for user. More:
https://docs.aws.amazon.com/vpc/latest/userguide/VPC_Subnets.html
number 28 no
vpc_public_subnet_tmpl VPC Public Subnet Template. Created for convenient use for a person
who is quite not enough familiar with networks and subnetworks.
Each index from the list of availability zones will be replaced
accordingly instead of the placeholder %d. Will be ignored if
variable vpc_public_subnets defined.
DEPRICETED: Try to avoid use this configuration, might be removed
in next versions. In this case, to avoid re-creations of cluster,
just describe your exists networks by vpc_public_subnets
parameters list for example:
["192.168.1.0/24", "192.168.2.0/24", "192.168.3.0/24", ...]
string "192.168.%d.0/24" no
vpc_public_subnets List of VPC Public Subnets. Each subnet will be assigned to
availability zone in order.
Mask must be not less than /28 for AWS. Subnets should not overlap
and should be in the same network with vpc_cidr
list(string) [] no

Outputs

Name Description
alb_dns_name ALB external endpoint DNS name. Should use to assign
"CNAME" record of public domain
alb_zone_id ALB canonical hosted Zone ID of the load balancer.
Should use to assign Route 53 "Alias" record (AWS only).
cluster_url Cluster public URL with schema, domain, and port.
All parameters depend on inputs values and calculated automatically
for convenient use. Can be created separately outside a module
igw_public_ips List of Internet public IPs. If cluster nodes are determined to be
in the public subnet (Internet Gateway used) all external network
requests will be via public IPs assigned to the nodes. This list
can be used for configuring security groups of related services or
connect to the nodes via SSH on debugging
nat_public_ips NAT public IPs assigned as an external IP for requests from
each of the nodes. Convenient to use for restrict application,
audit logs, some security groups, or other IP-based security
policies. Note: if set "node_allow_public" each node will get
its own public IP which will be used for external requests.
If var.nat_enabled set to false returns an empty list.
node_security_group Node Security Group ID which allow connecting from vpc and alb
security groups
private_subnets List of Private Subnet IDs created in a module and associated with it.
Under the hood is using "NAT Gateway" to external connections for the
"Route 0.0.0.0/0". When variable "node_allow_public" = false, this
network assigned to the instancies. For other cases, this useful to
assign another resource in this VPS for example Database which can
work behind a NAT (or without NAT at all and external connections
for security reasons) and not needs to be exposed publicly by own IP.
public_subnets List of Public Subnet IDs created in a module and associated with it.
Under the hood is using "Internet Gateway" to external connections
for the "Route 0.0.0.0/0". When variable "node_allow_public" = true,
this network assigned to the instancies. For other cases this useful
to assign another resource in this VPS for example Bastion host which
need to be exposed publicly by own IP and not behind a NAT.
route_table Route Table ID assigned to the current Vault HA cluster subnet.
Depends on which subnetwork assigned to instances Private or Public.
ssh_private_key SSH private key which generated by module and its public key
part assigned to each of nodes. Don't recommended do this as
a private key will be kept open and stored in a state file.
Instead of this set variable "ssh_authorized_keys". Please note,
if "ssh_authorized_keys" set "ssh_private_key" return empty output
vpc_id VPC ID created in a module and associated with it. Need to be exposed
for assigning other resources to the same VPC or for configuration a
peering connections. If configured vpc_id_external will return it
vpc_security_group VPC Security Group ID which allow connecting to "cluster_port",
"node_port" and "ssh_port". Useful for debugging when Bastion host
connected to the same VPC