aws-quickstart/quickstart-hashicorp-vault

Create into existing stack fails at VaultServerAutoScalingGroup stage

pveronneau opened this issue · 6 comments

I appear to be having issues with the VaultServerAutoScalingGroup stage.

image

I also get a similar error when trying to deploy into a new VPC but that is already covered under #57.

Checking into the stack after the error shows that the hosts were built, and they appear healthy in the auto scaling group. Any ideas on what is causing this step to fail, or any troubleshooting steps I should attempt?

Parameters used:
ACMSSLCertificateArn |
AccessCIDR | 0.0.0.0/0 | -
BastionSecurityGroupID | sg-09346cbab09740be0 | -
DomainName | vault.fqdn.com (redacted) | -
HostedZoneID | REDACTED | -
KeyPairName | vault-cluster | -
LoadBalancerType | External | -
PrivateSubnet1ID | subnet-ff21da88 | -
PrivateSubnet2ID | subnet-6956f30c | -
PrivateSubnet3ID | subnet-e20f14a4 | -
PublicSubnet1ID | subnet-fc21da8b | -
PublicSubnet2ID | subnet-7956f31c | -
PublicSubnet3ID | subnet-e80f14ae | -
QSS3BucketName | aws-quickstart | -
QSS3BucketRegion | us-east-1 | -
QSS3KeyPrefix | quickstart-hashicorp-vault/ | -
VPCCIDR | 10.50.0.0/16 | -
VPCID | vpc-51e62f34 | -
VaultAMIOS | CIS-Ubuntu-1604-HVM | -
VaultClientNodes | 1 | -
VaultClientRoleName | hashicorp-vault-client-role-iam | -
VaultInstanceType | m5.large | -
VaultKubernetesCertificate | - | -
VaultKubernetesEnable | false | -
VaultKubernetesHostURL | https://192.168.99.100:8443 | -
VaultKubernetesJWT | - | -
VaultKubernetesNameSpace | default | -
VaultKubernetesPolicies | default | -
VaultKubernetesRoleName | kube-auth-role | -
VaultKubernetesServiceAccount | vault-auth | -
VaultNumberOfKeys | 5 | -
VaultNumberOfKeysForUnseal | 3 | -
VaultServerNodes | 3 | -
VaultVersion | 1.4.0 | -

I can see no signal received back from the AutoScaling group hosts.
This suggests there is likely a communications issue from the Vault Server hosts.
If communications was working one would expect to see FAILURE signals and the error suggest no signals were received.

What is the network configuration of the VPC into which the hosts are being deployed?

The hosts need outbound internet access to update/patch and install utilities during bootstrapping.

If you are locking this down you would need to provide at least the following access:

  • Apt package repositories.
  • Vault servers for vault installation packages.
  • Access to at least the following AWS services:
    • CloudFormation (For signalling)
    • Secrets Manager (For unseal/root token storage)
    • EC2 AutoScaling (For understanding who my peers are for vault server configuration)
    • SSM Parameter Store (For coordinating cluster leader election/bootstrapping order)
    • Lambda (For coordinating cluster leader election/bootstrapping order)
    • KMS (For KMS autounseal of Vault and Encryption of the root token/unseal keys)
    • IAM/STS (For getting access to aforementioned services)
      (See: https://docs.aws.amazon.com/general/latest/gr/rande.html for service endpoints)

Once the above is addressed it could be an issue with the bootstrapping of the instances. Investigating this would require deployment of the stack with "Rollback" disabled and then take a look at the system logs for the servers.

This can be done either through the AWS Web Console/awscli or by logging onto the instances.
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instance-console.html
https://docs.aws.amazon.com/cli/latest/reference/ec2/get-console-output.html

If after the above has been tried and things are not working please reach out again and I will happily assist further.

I found an ACL error that I have corrected, however the build still fails at the same step. It appears that it could not fetch the functions.sh and bootstrap_server.sh scripts from the quick start bucket.

~snip~
  Downloading six-1.15.0-py2.py3-none-any.whl (10 kB)
Building wheels for collected packages: awscli, PyYAML
  Building wheel for awscli (setup.py): started
  Building wheel for awscli (setup.py): finished with status 'done'
  Created wheel for awscli: filename=awscli-1.18.82-py2.py3-none-any.whl size=3171812 sha256=3872a2ce3b7f5a99e0234c9f90ac767a6199e5b66a43942f5cb0162f9a02c0a1
  Stored in directory: /root/.cache/pip/wheels/d0/62/14/59f66f3e69cbbcc140d809d76c08b422ae69046356bdccbea3
  Building wheel for PyYAML (setup.py): started
  Building wheel for PyYAML (setup.py): finished with status 'done'
  Created wheel for PyYAML: filename=PyYAML-5.3.1-cp27-cp27mu-linux_x86_64.whl size=45644 sha256=328c7bec3358fbb466a9f6234dd2fe96147c8e417965382fec6b045e5f10a224
  Stored in directory: /root/.cache/pip/wheels/d1/d5/a0/3c27cdc8b0209c5fc1385afeee936cf8a71e13d885388b4be2
Successfully built awscli PyYAML
Installing collected packages: six, python-dateutil, docutils, urllib3, jmespath, botocore, pyasn1, rsa, futures, s3transfer, PyYAML, colorama, awscli
Successfully installed PyYAML-5.3.1 awscli-1.18.82 botocore-1.17.5 colorama-0.4.3 docutils-0.15.2 futures-3.3.0 jmespath-0.10.0 pyasn1-0.4.8 python-dateutil-2.8.1 rsa-3.4.2 s3transfer-0.3.3 six-1.15.0 urllib3-1.25.9
+ mkdir -p /opt/vault/policies/ /opt/vault/scripts/ /etc/vault.d/
+ aws s3 cp s3://aws-quickstart/quickstart-hashicorp-vault/scripts/functions.sh .
fatal error: An error occurred (403) when calling the HeadObject operation: Forbidden
+ aws s3 cp s3://aws-quickstart/quickstart-hashicorp-vault/scripts/bootstrap_server.sh .
fatal error: An error occurred (403) when calling the HeadObject operation: Forbidden
+ chmod +x bootstrap_server.sh
chmod: cannot access 'bootstrap_server.sh': No such file or directory
++ which bash
+ /bin/bash -e ./bootstrap_server.sh
/bin/bash: ./bootstrap_server.sh: No such file or directory
+ /usr/local/bin/cfn-signal -e 1 --stack HashiCorp-Vault --region us-west-2 --resource VaultServerAutoScalingGroup
Cloud-init v. 19.4-33-gbb4131a2-0ubuntu1~16.04.1 running 'modules:final' at Wed, 17 Jun 2020 20:09:47 +0000. Up 15.01 seconds.
Cloud-init v. 19.4-33-gbb4131a2-0ubuntu1~16.04.1 finished at Wed, 17 Jun 2020 20:10:33 +0000. Datasource DataSourceEc2Local.  Up 61.29 seconds

This appears to be the reason for the failure.

The same error also occurs on the stand-alone VPC version

Installing collected packages: six, python-dateutil, docutils, urllib3, jmespath, botocore, pyasn1, rsa, futures, s3transfer, PyYAML, colorama, awscli
Successfully installed PyYAML-5.3.1 awscli-1.18.82 botocore-1.17.5 colorama-0.4.3 docutils-0.15.2 futures-3.3.0 jmespath-0.10.0 pyasn1-0.4.8 python-dateutil-2.8.1 rsa-3.4.2 s3transfer-0.3.3 six-1.15.0 urllib3-1.25.9
+ mkdir -p /opt/vault/policies/ /opt/vault/scripts/ /etc/vault.d/
+ aws s3 cp s3://aws-quickstart/quickstart-hashicorp-vault/scripts/functions.sh .
fatal error: An error occurred (403) when calling the HeadObject operation: Forbidden
+ aws s3 cp s3://aws-quickstart/quickstart-hashicorp-vault/scripts/bootstrap_server.sh .
fatal error: An error occurred (403) when calling the HeadObject operation: Forbidden
+ chmod +x bootstrap_server.sh
chmod: cannot access 'bootstrap_server.sh': No such file or directory
++ which bash
+ /bin/bash -e ./bootstrap_server.sh
/bin/bash: ./bootstrap_server.sh: No such file or directory
+ /usr/local/bin/cfn-signal -e 1 --stack HashiCorp-Vault-HashiCorpVaultStack-RIZFC50XJU0Q --region us-west-2 --resource VaultServerAutoScalingGroup
Cloud-init v. 19.4-33-gbb4131a2-0ubuntu1~16.04.1 running 'modules:final' at Wed, 17 Jun 2020 20:46:17 +0000. Up 16.31 seconds.
Cloud-init v. 19.4-33-gbb4131a2-0ubuntu1~16.04.1 finished at Wed, 17 Jun 2020 20:47:06 +0000. Datasource DataSourceEc2Local.  Up 64.53 seconds

Modifying the command too: aws --no-sign-request s3 cp s3://aws-quickstart/quickstart-hashicorp-vault/scripts/functions.sh .
Results in a success.

I've isolated the cause of the S3 transfer error.

During stack creation, the IAM role VaultInstanceRole-sg* gets created with "root" policy. That policy contains following permission.

        {
            "Action": [
                "s3:GetObject"
            ],
            "Resource": "arn:aws:s3:::aws-quickstart-us-west-2/quickstart-hashicorp-vault/*",
            "Effect": "Allow"
        }

Whereas aws-quickstart-us-west-2 will get substituted with whatever region you deploy the image into.

This results in the following behavior:

ubuntu@ip-10-0-81-158:~$ aws s3 cp s3://aws-quickstart/quickstart-hashicorp-vault/scripts/functions.sh .
fatal error: An error occurred (403) when calling the HeadObject operation: Forbidden

You can however call the regional bucket name, and that results in a success:

ubuntu@ip-10-0-81-158:~$ aws s3 cp s3://aws-quickstart-us-west-2/quickstart-hashicorp-vault/scripts/bootstrap_server.sh .
download: s3://aws-quickstart-us-west-2/quickstart-hashicorp-vault/scripts/bootstrap_server.sh to ./bootstrap_server.sh

If you modify the IAM root policy to allow permission to the global bucket name:

        {
            "Action": [
                "s3:GetObject"
            ],
            "Resource": "arn:aws:s3:::aws-quickstart/quickstart-hashicorp-vault/*",
            "Effect": "Allow"
        }

It then correctly downloads the scripts as intended:

ubuntu@ip-10-0-94-151:~$ aws s3 cp s3://aws-quickstart/quickstart-hashicorp-vault/scripts/bootstrap_server.sh .
download: s3://aws-quickstart/quickstart-hashicorp-vault/scripts/bootstrap_server.sh to ./bootstrap_server.sh

@gargana Would it be possible to get the cloud-init updated to include the region or adjust the root policy template so this stack can successfully deploy?

Thanks for the work on this I will adjust the root policy to include the regional bucket piece.

At least the error changed?
image

@gargana can you review that last commit?

image

Magnificent, thanks for all your efforts @gargana

I suspect this effort also resolves issue #57