gruntwork-io/cloud-nuke

Gruntwork test ACM certificate getting removed in `us-west-1` and `eu-west-2`

Closed this issue · 25 comments

The wildcard ACM certificate used for testing in phxdevops is getting removed by the scheduled cloud-nuke job, but only in us-west-1 and eu-west-2 regions. We need this key to be present in these two regions for tests.

I haven't had a chance to dig in to see if we are using a config exclusion, and it isn't working, or if there is some other reason for the removal. It appears to be only these two regions, the key does not get deleted in other regions.

Temporarily disabled tests for these regions in terraform-aws-load-balancer (via this commit) to minimize test failures. Leaving a note here as a reminder to re-enable these regions once this issue is resolved.

Looked at the existing configuration and it doesn't seem to have any filtering for ACM resource type - https://github.com/gruntwork-io/cloud-nuke/blob/master/.circleci/nuke_config.yml.

Also, we don't pass in ACM as the exclude resource type - https://github.com/gruntwork-io/cloud-nuke/blob/master/.circleci/config.yml#L45.

It seems like the ACM for gruntwork testing seems to have domain name *.gruntwork.in so we can create a filter for that

image

It seems like we have an additional filter to exclude resources that are currently being used. However, I guess the certificates in those regions were not being used for some region?

Hey @arsci, I created certificates in both us-west-1 and eu-west-2 regions. I'm not sure if creating certificates solves the test failures. I'll leave this ticket open until we verify whether tests pass again.

Thanks @hongil0316! Yes, just creating those certs should solve the tests errors.

@gcagle3 you can re-enable tests for your branch in those two regions

@gcagle3 you can re-enable tests for your branch in those two regions

These have been re-enabled. Thanks!

Nice. I'll close this issue then!

It seems like the ACM for gruntwork testing seems to have domain name *.gruntwork.in so we can create a filter for that

image

@hongil0316 would it be possible to add a filter for the domain name 'gruntwork.in' as well? It looks like we'll need ACM certs for both '*.gruntwork.in' and 'gruntwork.in' for all of the load balancer tests to complete.

Example:
image

We'll need this for the following regions (if the filter is region specific, if not no worries):

  • us-east-1
  • us-east-2
  • us-west-1
  • us-west-2
  • eu-west-1
  • eu-west-2

I believe this change should cover both scenarios:

https://github.com/gruntwork-io/cloud-nuke/pull/580/files

Ah, that should work, splendid. Thank you, I'll re-close this one!

@hongil0316 hm, some bad news. It looks like the certificates for just 'gruntwork.in' (not *.gruntwork.in) are still being deleted every night.

As a thought, should the 'include' in this code actually be exclude? If I'm reading this right, it looks like 'include' is used to identify resources to be nuked and 'exclude' is being used to protect resources from deletion. In this case, we want to protect both *.gruntwork.in and gruntwork.in certs.

ACM:
  include:
    names_regex:
      - "^gruntwork.in"

@hongil0316 would there be anything else scheduled to clear these certs out? Monitoring things from yesterday, it looks like all of the 'gruntwork.in' certs were wiped out again last night.

Noting the following changes in the Phoenix account:

  • In us-east-1, us-east-2, us-west-2, and eu-west-1, the 'gruntwork.in' cert was deleted
  • In us-west-1 and eu-west-2, both the 'gruntwork.in' and '*.gruntwork.in' certs were deleted

Hey @gcagle3 @arsci, I just tested the cloud-nuke behaviour with the circleCi nuke_config.yml and it seems to work as expected. This is the command line I used:

aws-vault exec phxdevops -- go run main.go aws --resource-type acm --log-level debug --region us-east-1 --config .circleci/nuke_config.yml

And here is the debugging lines:
image

image

Looking at the CloudTrail activies, it seems to be deleted by the circle-ci-test user name

image

Hey @arsci @gcagle3, can you tell me how you are creating the ACM? Maybe the way I create the ACM is not the same as those certificates being created for unit tests?

This is how I created:

image

Hey @arsci @gcagle3, can you tell me how you are creating the ACM? Maybe the way I create the ACM is not the same as those certificates being created for unit tests?

This is how I created:

image

This is exactly the same process I've been using for both gruntwork.in and *.gruntwork.in.

Hmm the debug message doesn't seem to help too much... I wonder if the config file values are being reflected properly.
The odd thing is that we are having different behaviour when running in circleCi vs. locally...

The odd thing is that we are having different behaviour when running in circleCi vs. locally...

Would CircleCi be setting specific environment variables that would change the behavior?

Looking at the debug logs, it seems like the config is not properly parsed for whatever reason:

  DEBUG   shouldInclude result for ACM: arn:aws:acm:us-west-1:087285199408:certificate/1fc80fe9-5bcf-43d9-86c1-e5a6d3d6ffd9 w/ domain name: gruntwork.in, time: 2023-10-11 19:46:05.48 +0000 UTC, and config: {IncludeRule:{NamesRegExp:[] TimeAfter:<nil> TimeBefore:<nil> Tag:<nil>} ExcludeRule:{NamesRegExp:[] TimeAfter:2023-10-11 19:52:45.931855658 +0000 UTC m=-7198.437556645 TimeBefore:<nil> Tag:<nil>}}

You will notice that NamesRegExp for both IncludeRule and ExcludeRule are empty.

After internal discussion, we're wondering if the default tests (which run with no config) might be causing this. Looking at the test, it will check if the cert is in use and try to delete if it is not in use.

Comparing the above against the following we've observed:

  • In us-east-1, us-east-2, us-west-2, and eu-west-1, the 'gruntwork.in' cert was deleted
  • In us-west-1 and eu-west-2, both the 'gruntwork.in' and '*.gruntwork.in' certs were deleted

It is important to note that the '*.gruntwork.in' cert is 'in use' in the us-east-1, us-east-2, us-west-2 and eu-west-1 regions where it is not being deleted. It is not in use in us-west-1 / eu-west-2 where it is being deleted, which would make sense if the tests are actually deleting the certs. All certs being deleted are not in use.

@hongil0316 do you think it might be worth adjusting the test case to address this?

Had further discussion internally here: https://gruntwork-io.slack.com/archives/C6V3DJAHJ/p1697159092430149
We initially mitigated the issue by excluding ACM resource type - https://github.com/gruntwork-io/cloud-nuke/pulls?q=is%3Apr+is%3Aclosed.

After further troubleshooting, I realized that the bug existed when we had both config file + time filter present at the same time. Here is a fix for the bug - #607

As a quick follow-up, I can confirm that both the *.gruntwork.in and gruntwork.in certificate have not been deleted (and are still present) since this PR was merged. Great work on this @hongil0316!