NOTE: This setup does not ship ready out-of-the-box, there are some tweaks the end user should do.
This repo assumes you're running on Linux or macOS.
Ensure you have the following installed Terraform 0.13 (min), Packer 1.5 (min), Vault 1.4 (min), and ansible 2.8 (min). The configs should work with newer versions as well. This repo assumes some prior knowledge and experience with Vault, Terraform, Packer, and at least one of the cloud providers mentioned.
- Replace any instances of
example.com
in thetls-bootstrap/bootstrap.sh
with a domain that you control. - Replace any instances of
example.com
in theansible/group_vars/example.yml
file with the same domain as the previous step. cd
into theansible
folder and run the following to create Consul tokens for the necessary use cases:uuidgen | tr '[:upper:]' '[:lower:]' > roles/consul/files/tokens/agent uuidgen | tr '[:upper:]' '[:lower:]' > roles/consul/files/tokens/haproxy uuidgen | tr '[:upper:]' '[:lower:]' > roles/consul/files/tokens/vault
cd
into thetls-bootstrap
folder, run./bootstrap.sh
.
Pick a cloud provider - AWS, GCP, or Azure and setup the infrastructure as follows. It's recommended to have a console for your chosen provider open and available, having logged in.
- Ensure you have a default VPC that has a minimum of 2 subnets and outbound internet access. If you don't have one or don't want one, adjust the Packer templates and Terraform to use a different pre-configured VPC.
- Adjust the Packer and Terrafom variables.
- Check the
aws/packer/example.vars
file and supply credentials and a region. - Check the
aws/terraform/example.tfvars
file and adjust the hostnames and trusted external IPs to fit your setup. It's recommended to add the outbound IP of your machine to this list. NOTE: AWS requires that these addresses be in CIDR format.
- Check the
- Build the images with Packer.
cd
into theaws/packer
folder, runpacker build --var-file example.vars consul.json packer build --var-file example.vars vault.json
- Run the Terraform.
cd
into theaws/terraform
folder, runterraform init ; terraform apply --var-file example.tfvars
and if the plan looks good, approve it.- While Terraform is running, setup the domain configured in
aws/terraform/example.tfvars
to point at the AWS ELB. This can be done with a CNAME record in your DNS zone, or by resolving the DNS record (dig <hostname>
) and editing the hosts file as follows. If thedig
command doesn't produce IPs for the ELB, ensure it's finished provisioning and retry.<ip_1> vault-0.vault.example.com vault-1.vault.example.com vault.example.com consul.example.com <ip_2> vault-0.vault.example.com vault-1.vault.example.com vault.example.com consul.example.com <ip_3> vault-0.vault.example.com vault-1.vault.example.com vault.example.com consul.example.com
- If the Terraform apply step hangs on provisioning
null_resource.consul_acl_bootstrap
, check that Consul is responding atconsul.<your_domain>:8501
. - If you receive an error similar to
Failed to create new policy: Unexpected response code: 500 (<specific error message>)
, the situation can be recovered by locating thenull_resource consul_acl_bootstrap
resource and commenting all lines of thecommand
except those which start withconsul acl policy
orconsul acl token
. Terraform should then be re-run.
- Ensure you have a default VPC (named 'default') with a subnet (named 'default') that has outbound internet access. If you don't have one or don't want one, adjust the Packer templates and Terraform to look for a different pre-configured VPC.
- Adjust the Packer and Terrafom variables.
- Check the
gcp/packer/example.vars
file and supply credentials and a project ID. - Check the
gcp/terraform/example.tfvars
file and supply credentials and adjust the hostnames and trusted external IPs to fit your setup. It's recommended to add the outbound IP of your machine to this list.
- Check the
- Build the images with Packer.
cd
into thegcp/packer
folder, runpacker build --var-file example.vars consul.json packer build --var-file example.vars vault.json
- Run the Terraform.
cd
into thegcp/terraform
folder, runterraform init ; terraform apply --var-file example.tfvars
and if the plan looks good, approve it.- While Terraform is running, setup the domain configured in
gcp/terraform/example.tfvars
to point at the Load Balancer frontends. This can be done with A records in your DNS zone, or by editing the hosts file:<consul_ip> consul.example.com <vault_ip> vault-0.vault.example.com vault-1.vault.example.com vault.example.com
- If the Terraform apply step hangs on provisioning
null_resource.consul_acl_bootstrap
, check that Consul is responding atconsul.<your_domain>:8501
. - If you receive an error similar to
Failed to create new policy: Unexpected response code: 500 (<specific error message>)
, the situation can be recovered by locating thenull_resource consul_acl_bootstrap
resource and commenting all lines of thecommand
except those which start withconsul acl policy
orconsul acl token
. Terraform should then be re-run.
- Ensure you have a virtual network setup that has outbound internet access with a Network Security Group named 'default' attached to the relevant subnet.
- Create a new resource group, called 'default'.
- Create a new virtual network, called 'default' with whatever IP space fits your needs.
- Create a new network security group, called 'default' and associate it with the subnet in the virtual network.
- Adjust the Packer and Terrafom variables.
- Check the
azure/packer/example.vars
file and supply a subscription ID, a resource group name, and a region. - Check the
azure/terraform/example.tfvars
file and adjust the hostnames and trusted external IPs to fit your setup.
- Check the
- Login to the azure CLI if not already, and select the appropriate subscription with
az account set -s <subscription name or ID>
. - Build the images with Packer.
cd
into theazure/packer
folder, runpacker build --var-file example.vars consul.json packer build --var-file example.vars vault.json
- If you run into issues authenticating with AzureAD, service principal authentication can be used instead. See the packer docs.
- Run the Terraform.
cd
into theazure/terraform
folder, runterraform init ; terraform apply --var-file example.tfvars
and if the plan looks good, approve it.- While Terraform is running, setup the domains configured in
azure/terraform/example.tfvars
to point at the Load Balancer public IPs. This can be done with A records in your DNS zone, or by editing the hosts file:<consul_public_ip> consul.example.com <vault_public_ip> vault-0.vault.example.com vault-1.vault.example.com vault.example.com
- You may receive an error relating to HealthProbes when creating the scale set for Consul, in this case, re-attempt the running of Terraform.
- If you receive an error similar to
Failed to create new policy: Unexpected response code: 500 (<specific error message>)
, the situation can be recovered by locating thenull_resource consul_acl_bootstrap
resource and commenting all lines of thecommand
except those which start withconsul acl policy
orconsul acl token
. Terraform should then be re-run.
- Prepare a client certificate for use with Consul.
cd
into theansible
folder, runopenssl pkcs12 -export -in consul.crt -inkey consul.key -out consul.p12
and enter a password when prompted.- Import this certificate into your browser of choice.
- Try to access Consul by browsing to
https://consul.<your_domain>:8501/ui
, select the certificate when prompted. - Click on the
ACL
navbar item. - Find the master token created during bootstrap and supply it to the UI, it should be in a file at
<cloud_provider>/terraform/master-token
. - Try to access the HAProxy stats page for vault by visiting
http://vault.<your_domain>/haproxy-stats
orhttp://<stats_ip>/haproxy-stats
if running on GCP. - Initialise Vault and unseal if you wish to experiment further.
cd
into theansible
folder, and setup some useful environment variables.export VAULT_ADDR=https://vault-0.vault.<your_domain> export VAULT_CACERT="$(pwd)/vault-ca.crt"
- Run
vault operator init
. - Copy the unseal keys and root token from the output and paste them into a text editor.
- Unseal the specific Vault node by running the following
vault operator unseal
and supplying an unseal key when prompted. Repeat this process until the node is unsealed. - Once enough keys have been entered (3 by default), refresh the HAProxy stats page and look for the server that was just unsealed (vault-0) - it should be green at the bottom.
The Terraform config in this repo uses the local filesystem for state storage instead of remote state. It is highly recommended to use a remote storage mechanism for Terraform's state.
Additionally, there are no version pins for any of the providers and it's recommended that you set some.
It's assumed that a network and subnet are available in which to setup the cluster, please adjust the automation accordingly.
Most variables are already setup with sensible values, but secrets or sensitive variables should be set per installation along with any other installation-specific variables.
The example.yml
group variables are not stored securely for the purposes of enabling easy experimentation with this setup.
Of course for a proper deployment, these secrets should be appropriately protected using something such as ansible-vault or by not committing them at all.
NOTE: Special remarks about Consul tokens are made further on, though they can be configured through variables.
consul_user_password_hash - The password hash to set for the Consul system user
consul_gossip_encryption_key - The encryption key used to secure Gossip traffic between Consul nodes, generated with `consul keygen`
consul_template_user_password_hash - The password hash to set for the consul-template system user
vault_lb_hostname - The external hostname used to access the load-balanced Vault endpoint.
vault_user_password_hash - The password hash to set for the Vault system user
There is only a handful of variables needed by Terraform, each of which should be tweaked for your needs:
vault_hostname - The hostname which will be used to access Vault's load-balanced endpoint.
consul_hostname - The hostname which will be used to access Consul's load-balanced endpoint.
trusted_external_ips - The external IPs to permit when configuring external access to Vault and Consul.
consul_retry_join_config - This should not require adjustment unless the cloud auto-join tag or value is changed.
Some variables are provider-specific, such as GCP:
credentials - The path on disk of a credentials file for Terraform to use.
project - The ID of the project to provision resources in.
region - The region in which to provision resources.
Consul tokens are required for the Consul agent, for consul-template, and for Vault.
The SecretID
values for each token are set in advance so that the machines can boot and automatically be able to perform their function without extra setup.
These are configured through variables in Ansible, which by default look for the tokens on the filesystem using the lookup
plugin.
You should populate these tokens with your own values, which must be UUIDs, and can be supplied through files or by setting the ansible variables explicitly.
NOTE: If you choose to use ansible variables instead of files, the ACL bootstrap process in Terraform will need to be adjusted to remove the creation of Consul tokens.
The relevant ansible variables are as follows:
consul_agent_acl_token - The token for the Consul agent to use, expects a corresponding file in `ansible/roles/consul/files/tokens/agent`
consul_default_acl_token - The default token used by the Consul agent, expects a corresponding file in `ansible/roles/consul/files/tokens/agent`
consul_template_consul_token - The token used by consul-template to obtain node data about vault, expects a corresponding file in `ansible/roles/consul/files/tokens/haproxy`
vault_consul_acl_token - The token used by Vault to access Consul's KV store, expects a corresponding file in `ansible/roles/consul/files/tokens/vault`
Certificates are used to secure traffic from Consul and Vault (TLS server certificates) as well as to Consul (TLS client certificates).
You should generate your own keys and certificates signed by a CA you trust.
Specific recommendations about TLS are in the Design section and a script is provided in tls-bootstrap
to get things started.
NOTE: Some values (particularly CNs and SANs) will need to be adjusted depending on hostnames in use.
Particular attention should be paid to the hostnames on the certificate to ensure that communication isn't blocked.
Consul expects a name of consul
to be present within the Consul server and client certificates by default.
Ansible and Terraform expect the following files to be available at the root of the ansible
folder:
- consul.crt - The certificate file (optionally containing the issuing CA's certificate) to use for Consul server and client authentication.
- consul-ca.crt - The certificate file of the CA that signs the certificate in
consul.crt
. - consul.key - The private key file to use for Consul server and client authentication.
- vault.crt - The certificate file (optionally containing the issuing CA's certificate) to use for Vault server authentication.
- vault-ca.crt - The certificate file of the CA that signs the certificate in
vault.crt
. - vault.key - The private key file to use for Vault server authentication.
Hostnames are only needed in a few places, and should be adjusted before provisioning. See
haproxy-consul-template
ansible role, defaultsvault_hostname
variable in Terraformconsul_hostname
variable in TerraformCERTIFICATE_DOMAIN
variable intls-bootstrap/bootstrap.sh
The automation does NOT create any DNS records, but does expect them to exist and therefore you should add the necessary automation to Terraform or arrange some other means of ensuring that the expected hostname resolves to an address on the load-balancer.
There is no provision made to enable backups as the situation of each user is likely to be different. Since Consul is the backing store for Vault, an automated process that takes a snapshot of Consul and saves it somewhere would probably be useful.
All external access is IP controlled within security groups configured through Terraform. HTTPS communication to Consul is exposed via a load-balancer on port 8501 and traffic is sent to the autoscaling group. HTTPS communication to Vault is exposed via a load-balancer on port 443 and traffic is sent to HAProxy on the Vault nodes. Depending on the hostname supplied, traffic is routed either to any available Vault node or directly to a specific node.
This is done so that individual Vault nodes can be unsealed externally and so as to enable initialisation of Vault.
Consul and Vault are exposed through a load-balancer and are expected to be available at vault.<domain>
and consul.<domain>
.
Individual Vault server nodes are available at <instance name>.vault.<domain>
where <instance name>
is the name of the VM within the cloud provider.
By default this is something like vault-0
.
Various systems need to be aware of the hostnames used for access, as well as requiring certificates with appropriate CNs and SANs. In particular these are:
- HAProxy (via the
haproxy-consul-template
ansible role) - Terraform (via the
vault_hostname
andconsul_hostname
locals)
Private CAs are created to secure traffic to Consul and Vault, and the script in tls-bootstrap
is designed to achieve this.
You can use whatever certificates you'd like, including Let's Encrypt but be aware of the following:
- The certificates and keys are baked into the machine images
- Ansible expects the certificate and key files to be available to it, so they should be placed in the
ansible
folder or within thefiles
folder of the relevant role - If using a public CA, ensure that the
.crt
file contains the certificate of the issuing CA and any intermediates, and that the-ca.crt
file contains the certificate of the root CA. - The CA used to secure outgoing communication (TLS server certs) from Consul must be the same as the one used to secure incoming communication (TLS client certs), so a private CA is recommended.
Certificates are needed at various points in the provisioning process, chiefly by ansible and Terraform. Ansible bakes the certificate and key files into the machine image, and Terraform uses the Consul certificate files in the ACL bootstrapping process.
NOTE: The CNs and SANs used on certificates are critical and must match various expected names.
Of course for external access, the certificates should have consul.<domain>
, vault.<domain>
, and *.vault.<domain>
names.
In addition, to enable Consul to communicate securely with itself, it expects a given name to be present in the certificate, by default this is consul
.
If you wish to adjust this, be sure to update the Consul configuration to expect the newly assigned value.
An autoscaling group is created for Consul, but with no scaling rules as this is a very installation-specific concern. The Consul nodes are designed to be able to join a cluster with minimal fuss and use the cloud auto-join mechanism to do so. The agent goes through a bootstrap process on startup to configure the cloud auto-join settings as well as setting the agent ACL. The cloud auto-join settings are configured in Terraform.
The ACL system is bootstrapped using the bootstrap process and currently is achieved using a null resource in Terraform to call the relevant APIs from the machine running Terraform. The master token is captured and output to the filesystem for the operator to do with as they please. Some essential policies and tokens are also created at this point to enable Vault and consul-template to function. The bootstrap process will retry indefinitely until it succeeds, which can lead to an infinite provisioning loop if the bootstrap operation is successful but subsequent operations fail. In this situation, the bootstrap process should be reset, or the relevant lines should be commented allowing Terraform to re-run.
Having Consul tokens within machine images has been avoided as much as possible, however a certain amount of it is necessary.
For the purposes of configuring the Consul agent with the necessary permissions to do node updates, a file is placed in /etc/consul.d
for use in the agent bootstrap process.
Once the agent has been configured to use the token with the agent ACL API, the token file is deleted as token persistence within Consul is enabled.
HAProxy is installed on the Vault nodes to be able to direct traffic as necessary and achieve the direct-to-node or load-balanced access as previously described. To achieve this, there are two types/groups of backends - a backend per node for the direct-to-node access, containing only that specific node, and a single backend containing all nodes for the load-balanced access. HAProxy is deliberately unaware of the content of any HTTP requests going through it (except stats), and uses the SNI conversation as a judgement for where to send traffic. The HAProxy frontends can optionally accept the proxy protocol (defaults to on) from the fronting load-balancer. All backends within HAProxy (individual nodes and load-balanced pool) have health checks enabled. The load-balanced backend uses an HTTPS check to Vault's health endpoint and the individual node backends use HTTPS health checks to Vault's health endpoint, permitting most error conditions. In addition, all backends send the proxy protocol to Vault.
Consul-tempmlate is used to query Consul for Vault node information and populates HAProxy's configuration accordingly for the individual node backends as well as the load-balanced backend.
Vault is setup to receive the proxy protocol and is configured such that any IP in the subnet is allowed to send the proxy protocol to Vault. This enables multiple Vault nodes to load-balance one another (with HAProxy) without needing to authorise specific IPs or needing to dynamically configure Vault according to what nodes are available.
It's expected that the file
audit method will be used and so logrotate has been configured accordingly, especting an audit log file to be placed in /var/log/vault/
with an extension of .log
.
It should be noted that auto-unsealing is not in use in this installation and the initialisation of Vault is left as an exercise for the operator.
It is hypothetically possible to create one or more PKI backends within Vault and have them serve as the CAs for securing Consul and Vault communication. This could give you such benefits as not needing to create machine images that contain certificates and keys, instead having the nodes generate keys and obtain signed certificates from Vault upon startup.
The reason this hasn't been done is that it makes the overall setup more complicated and requires more initial configuration during the setup of the system, as it creates a cyclical dependency on the cluster itself. You may of course pursue such a setup should you wish, just bear in mind the differences between automating 'the first' setup and 'the ongoing' setup. If the cluster needed to be rebuilt, it's likely that you would need to revert to storing certificates and keys within the image until the cluster can be brought up from scratch again. One way to achieve the self-certificating setup would be to use consul-template to request certificates from Vault, and restarting Consul or triggering a config reload when the files changed. It would be best to use the Vault agent as well to maintain a Vault token and have the agent make requests to Vault on behalf of consul-template. You would also need to change the explicit CA cert file in Consul's config, with a directory to permit the change in CA to take place as new agents are rolled out to the autoscaling group.
Incidentally the commentary mentions Consul as a target of automated certificates, but the approach for Vault would be very similar.
It would also be possible to use a secret storage mechanism on a cloud provider to store the certificates and keys and have the machines pull them out of storage on startup. This hasn't been done in order to simplify the setup and to avoid introducing further dependencies outwith those already in use. Depending on your situation, you may wish to avoid trusting such a tool, or you may consider that acceptable.
If you wanted to pull certificates in on startup, it would be reasonably trivial to do and the userdata field could be used fairly effectively.
In this setup, Consul tokens are created with known secret values already provisioned within components such as consul-template and Vault. The tokens are stored in the machine image and removed if possible after startup (Consul only).
It would be possible to instead store these tokens within Vault or even a cloud provider's secrets storage facility and have the nodes retrieve them on startup. This hasn't been done for similar reasons to those discussed in the previous section - to avoid introducing unnecessary dependencies, to limit the reach of trust, and also to avoid complexity in the setup.
Once again, such a setup is fairly trivial to achieve, and the recommendation is to use userdata to trigger the behaviour.
It's possible to use the Consul provider for Terraform to create ACL policies and tokens within Consul.
In this setup, policies and tokens are instead created by calling the APIs via the Consul binary. The reason for this is, again to avoid introducing complexity into the initial setup. When managing resources via Terraform, layered and explicitly-ordered dependencies within the same configuration don't always work well. The CLI-based approach allows for plenty of retries and a more robust experience than attempting to wire the Consul provider up to a cluster that doesn't yet exist or is still being provisioned.
Again, you could bootstrap the cluster and then go on to manage the ACL policies and tokens within Terraform, including importing the master and default tokens and this has been left as an exercise for the operator. It would also be possible to use Vault to create and distribute tokens for use with Consul, and much like the previous sections, this has been left out so as to not introduce complexity.