[BUG] Intermittent output helper destroying resources
exoaturner opened this issue · 1 comments
Checklist
- [ x ] Upgrade Terraspace: Are you using the latest version of Terraspace? This allows Terraspace to fix issues fast. There's an Upgrading Guide: https://terraspace.cloud/docs/misc/upgrading/
- [ x ] Reproducibility: Are you reporting a bug others will be able to reproduce and not asking a question. If you're unsure or want to ask a question, do so on https://community.boltops.com
- [ x ] Code sample: Have you put together a code sample to reproduce the issue and make it available? Code samples help speed up fixes dramatically. If it's an easily reproducible issue, then code samples are not needed. If you're unsure, please include a code sample.
My Environment
Software | Version |
---|---|
Operating System | Ubuntu 22.04 |
Terraform | 1.4.6 |
Terraspace | 2.2.6 |
Ruby | 3.1.2p20 |
Expected Behaviour
During terraspace all (plan|up)
the output helper for tfvars should consistently find values in the terraform statefile.
Current Behavior
Under unknown conditions the output helper doesn't alway find existing values in the terraform statefile. Which can result in resources being destroyed unintentionally. For example if a KMS key ID defaults to a mock.
Step-by-step reproduction instructions
Run terraspace all up
.
Code Sample
Unable to share the exact producible steps because its intermittent. I am also unable to share the code sample because of companies security policies. However, I have done my own investigation and found some interesting things that will help with implementing a fix, see the below.
The temporary statefiles that terraspace pull from terraform state (in /tmp/terraspace/remote_state/ directory do contain the values. This suggests that terraspace either isn't reading them correctly or isn't reading them at the correct time (possible race condition).
The below image is a snapshot of on of the values in the statefile that wasn't populated in my most recent expieriance of the problem:
As you can see the value for vpc_endpoint_ec2messages_id
is in the statefile:
(I had to change the directory because of CICD)
So this means terraspace/terraform is getting the state correctly, however, it appears terraspace is not always loading it correctly. This is inconsistent across the deployment when other stacks are using the same outputs (This is why I believe it's a race condition).
As you can see from the below code snapshot the output helps are setup correctly:
Solution Suggestion
I have two things to mention regarding this issue.
- Obviously the above is a problem and should be addressed some how (Not sure exactly).
- Terraspace should fail hard if mocks are detected in the compiled terraform code when deploying (after templating).
We have implemented a workaround thats worth sharing. Obviously this bug can still be annoying when large projects fail to deploy but this workaround will make it less likely to destroy any infrastructure by accident.
In the ./config/hooks/terraform.rb
file we added some hooks for detecting mocks to stop the deployments so we don't break anything.
# Terraspace calls out to the terraform command.
# You can execute commands before and after each command with CLI hooks.
#
# See: https://terraspace.cloud/docs/config/hooks/terraform
# WARNING: A hack to stop terraspace from deploying a stack if a mock value is found.
#
# Checks for key word 'mock' because common use for mocking names
# Checks for 00000000000 because common use for mocking ids
# Checks for 10.0. because common use for mocking ips and cidrs
before("plan",
label: 'Warn about found mocks at plan stage',
execute: "! grep -qsEi '(mock|0000000000|10\.0\.)' *.tfvars || echo \"\033[0;33mWARNING: Found mock values in $(basename $(pwd))\033[0m\" ",
exit_on_fail: false,
)
before("apply",
label: 'Fail if mock values found at deploy stage',
execute: "! grep -qsEi '(mock|0000000000|10\.0\.)' *.tfvars || \{ echo \"\033[0;31mFAILURE: Found mock values in $(basename $(pwd))\033[0m\"; false; \} ",
exit_on_fail: true,
)
This is not a permanent fix because it will fail to deploy still.