Sweep all resources not working

Question

Sweep all resources not working

aghassemlouei opened this issue 5 years ago · 6 comments

When leveraging the entire all.yml services list and using 0.4.1 on macos 10.15.3 against AWS GovCloud regions, us-gov-west-1 and us-gov-east-1 where resource counts are higher than 800 per service awsweeper hangs and often needs to have the config file only include a subset or each service individually.

Had to break out ebs, eip, and security groups out to individual executions. Also, it appears as though vpc peering and public ip associations make it difficult to easily delete vpc's.

Answer 1 · 2020-02-23T21:23:22.000Z

Hi @aghassemlouei again. Thanks for providing the issue. I made some bigger changes and fixes commited to master in the last week, but haven't released them yet. Can you check if your problems still occur on master (or with v0.5.0, which I will release by tomorrow).

Answer 2 · 2020-02-25T06:36:31.000Z

Evening @jckuester,

Just ran the following steps and ran into similar issues but with incremental improvements:

curl -LO https://github.com/cloudetc/awsweeper/releases/download/v0.5.0/terradozer-0.5.0-darwin-amd64.tar.gz
tar -xzf terradozer-0.5.0-darwin-amd64.tar.gz
chmod +x terradozer-0.5.0-darwin-amd64/terradozer
cat > custom.yml << EOF
aws_ami:
aws_autoscaling_group:
aws_cloudformation_stack:
aws_ebs_snapshot:
aws_ebs_volume:
aws_efs_file_system:
aws_eip:
aws_elb:
aws_instance:
aws_internet_gateway:
aws_key_pair:
aws_kms_alias:
aws_kms_key:
aws_launch_configuration:
aws_nat_gateway:
aws_network_acl:
aws_network_interface:
aws_route53_zone:
aws_db_instance:
aws_route_table:
aws_s3_bucket:
aws_security_group:
aws_subnet:
aws_vpc:
aws_vpc_endpoint:
EOF
./terradozer-0.5.0-darwin-amd64/terradozer --region us-gov-west-1 --profile canary --dry-run custom.yml

When executed all at once services wouldn't fully enumerate their resources, however, when broken out into smaller chunks .e.g., s3 buckets and rds, things did work. C

At least the s3 executions seem to be effective now so I closed out #71. When I let the execution run over the weekend apparently the vpc peering connections was throwing awssweeper/terraform for a loop with dependencies that couldn't be broken so that may also be something to take into consideration if folks just import the all.yml and execute it.

Thanks again for the quick release hopefully this data is useful and not bothersome!

Answer 3 · 2020-02-25T09:00:17.000Z

Thanks for your feedback. I haven't tested awsweeper at scale yet and your insights are very interesting and helpful - I'll do my best to improve your experience with the tool. Let's go into more detail about what you experienced:

wouldn't fully enumerate their resources: does this happen during the listing/dry-run stage before starting to delete or are all resources fully listed and only during the deletion stage resources are not fully enumerated?
Note that I haven't implemented pagination yet with the AWS API, which might also causing an issue that a limited number of resources is listed and not all (per particular resource type). But breaking into smaller chunks shouldn't really help with this issue, but running awsweeper several times.
Vpc peering connections was throwing awssweeper/terraform for a loop: what did the output look like here? Did it say 'will retry to delete resource'? It might be that the max_retries parameter of Terraform is set too high (default 25) an therefore a failed deletion is retried to often and hangs for very long time (hashicorp/terraform-provider-aws#1209, https://www.terraform.io/docs/providers/aws/index.html#max_retries). Someone added themax_retries parameter to awsweeper, but it is disabled currently. I will fix that.

Answer 4 · 2020-02-25T09:27:47.000Z

Hmm, I just looked into the code how Terraform deletes a VPC (see below). In the case you described, it is a DependencyViolation (because vpc peering connection still attached), so Terraform will retry deleting for 5 minutes. This is not what we really want and unfortunately the max_retries parameter mentioned above will not help here....

	err := resource.Retry(5*time.Minute, func() *resource.RetryError {
		_, err := conn.DeleteVpc(deleteVpcOpts)
		if err == nil {
			return nil
		}

		if isAWSErr(err, "InvalidVpcID.NotFound", "") {
			return nil
		}
		if isAWSErr(err, "DependencyViolation", "") {
			return resource.RetryableError(err)
		}
		return resource.NonRetryableError(fmt.Errorf("Error deleting VPC: %s", err))
	})
	if isResourceTimeoutError(err) {
		_, err = conn.DeleteVpc(deleteVpcOpts)
		if isAWSErr(err, "InvalidVpcID.NotFound", "") {
			return nil
		}

Answer 5 · 2020-03-02T21:31:10.000Z

Hi @aghassemlouei again. I thought about the problem again and came up with a solution. Let me know what you think.

awsweeper can now be run with a timeout for the delete operation, i.e., awsweeper --timeout 1s config.yml.

This way, if a VPC or any other resource still has a dependency, the delete times out in, for example, 1s (default is set to 20s). Here is how the output looks like:

   • SHOWING RESOURCES THAT WOULD BE DELETED (DRY RUN)

	---
	Type: aws_vpc
	Found: 1

		Id:		vpc-1234
		Tags:		[Name: foo] 

	---

   • TOTAL NUMBER OF RESOURCES THAT WOULD BE DELETED: 1
      • Are you sure you want to delete these resources (cannot be undone)? Only YES will be accepted.
        Enter a value: YES
   • STARTING TO DELETE RESOURCES
      • will retry to delete resource                      id=vpc-1234 type=aws_vpc
   • FAILED TO DELETE THE FOLLOWING RESOURCES (RETRIES EXCEEDED): 1
      • aws_vpc                                            error=destroy timed out (1s) id=vpc-1234
   • TOTAL NUMBER OF DELETED RESOURCES: 0

Answer 6 · 2020-05-08T00:33:02.000Z

This worked significantly better! If for nothing else than the feedback presented to the end user. Syntax provided for posterity:

curl -LO https://github.com/cloudetc/awsweeper/releases/download/v0.7.0/awsweeper_0.7.0_darwin_amd64.tar.gz
tar -xzf awsweeper_0.7.0_darwin_amd64.tar.gz 
chmod +x awsweeper_0.7.0_darwin_amd64/awsweeper 
cat > custom.yml << EOF
aws_ami:
aws_autoscaling_group:
aws_cloudformation_stack:
aws_ecs_cluster:
aws_ebs_snapshot:
aws_ebs_volume:
aws_efs_file_system:
aws_eip:
aws_elb:
aws_iam_instance_profile:
aws_iam_role:
aws_instance:
aws_internet_gateway:
aws_key_pair:
aws_kms_alias:
aws_kms_key:
aws_lambda_function:
aws_launch_configuration:
aws_nat_gateway:
aws_network_acl:
aws_network_interface:
aws_db_instance:
aws_route53_zone:
aws_route_table:
aws_s3_bucket:
aws_security_group:
aws_subnet:
aws_vpc:
aws_vpc_endpoint:
EOF
./awsweeper_0.7.0_darwin_amd64/awsweeper --region us-gov-west-1 --profile core --timeout 1s custom.yml

The failure conditions were far more clear with a faster turnaround. The only cosmetic bit of feedback would be regarding the AWS-managed IAM roles or the KMS keys. Terraform seems to complain but it's definitely a non-issue:

error deleting IAM Role (AWSServiceRoleForSupport) policy attachments: Error deleting IAM Role AWSServiceRoleForSupport: UnmodifiableEntity: Cannot perform the operation on the protected role 'AWSServiceRoleForSupport' - this role is only modifiable by AWS

AccessDeniedException: User: arn:aws-us-gov:iam::123456789:user/aghassemlouei is not authorized to perform: kms:ScheduleKeyDeletion on resource: arn:aws-us-gov:kms:us-gov-west-1:123456789:key/1234567-1234-1234-1234-1234567

Closing this out as the major issues have been addressed; thanks for all your hard work!