Endless loop on terraform destroy with nodepool in state "FAILED_DESTROYING"

Question

Endless loop on terraform destroy with nodepool in state "FAILED_DESTROYING"

salyh opened this issue 5 months ago · 6 comments

Description

I provisioned nodepools via terraform. After the nodepool is active I want to destroy them via terraform destroy.
Then terraform runs for more than 30 minutes trying to destroy the nodepool. Looking into the DCD UI i see that the status is "FAILED_DESTROYING" (see screenshot).

When I now cancel the terraform run via STRG-C and rerun it the nodepool get really destroyed in a matter of a few seconds.

Expected behavior

Properly destroy nodepools in state "FAILED_DESTROYING" without the need of a terraform destroy re-run

Environment

Terraform version:

OpenTofu v1.7.2

Provider version:

v6.4.17

OS:

References

#579

Answer 1 · 2024-06-27T13:13:02.000Z

Hello! Can you provide the Terraform plan that led to this situation? I want to reproduce this scenario.

Our Terraform provider only sends the DELETE request and then waits for the resource to be deleted. In the case you described, the resource reached FAILED_DESTROYING state for some reason (API-related) so the resource wasn't deleted and the loop kept going. It's not an endless loop, it's a loop that has a specific timeout.

What I think it happened in this scenario (I still need to reproduce this to be sure):

terraform destroy sends the first DELETE request;
DELETE request fails, the API sets the nodepool to FAILED_DESTROYING state;
Terraform periodically checks the deletion of the resource (the loop) but since the resource is always there, in the FAILED_DESTROYING state, Terraform keeps on checking;
you cancel the previous command and then run terraform destroy again which sends another DELETE request, this final DELETE request successfully deletes the nodepool;

Related to the description from Expected behavior:

Properly destroy nodepools in state "FAILED_DESTROYING" without the need of a terraform destroy re-run

As I said above, the provider only sends DELETE requests to the API and waits for the resource to be deleted, the deletion process is handled by the API. If the API sets the resource in FAILED_DESTROYING state, it means that something went wrong during the deletion, it has nothing to do with the TF provider, the provider only sends the requests.

The provider did the job, the DELETE request was sent, the fact that the resource was in FAILED_DESTROYING has nothing to do with the provider. You choose to run terraform destroy again, so basically you sent another DELETE request on a resource that was in FAILED_DESTROYING and somehow it worked, but this is solely related to the API.

Answer 2 · 2024-06-27T13:34:24.000Z

It's not an endless loop, it's a loop that has a specific timeout.

What is the timeout?

The provider did the job, the DELETE request was sent, the fact that the resource was in FAILED_DESTROYING has nothing to do with the provider. You choose to run terraform destroy again, so basically you sent another DELETE request on a resource that was in FAILED_DESTROYING and somehow it worked, but this is solely related to the API.

Mhh, that is passing the buck to and from each other. I think the provider needs to handle API failures appropriately.

For the plan and logs please refer to Internal support ticket Ticket 207171709

Answer 3 · 2024-06-27T14:12:56.000Z

@salyh

What is the timeout?

3 hours, I will also leave a reference to that.

Mhh, that is passing the buck to and from each other. I think the provider needs to handle API failures appropriately.

It really isn't, besides interrupting the loop when the resource reaches a FAILED_DESTROYING state (I will analyze the implication of this) I don't think that something useful can be implemented. The provider is responsible with taking the data from the tf file and sending it to the API in case of a create/update command and with sending a DELETE request in case of a resource deletion. Let's say that, in the provider, when you receive a FAILED_DESTROYING we send a DELETE request again. There is no guarantee that the result will be different. Also, how many requests should we send before understanding that something is really not working inside the API?

The DELETE request should be done once and the API should take care of it. The provider only needs to check that the resource is properly deleted before informing the user and reflecting that change in the tf state.

Answer 4 · 2024-06-27T14:51:42.000Z

I disagree - the provider can (and should) indeed send multiple DELETE requests because:

It wouldn't do any harm
DELETE is idempotent
deal better with api failures

Answer 5 · 2024-06-27T15:51:00.000Z

@salyh I agree with the points presented above but still, sending multiple DELETE requests doesn't guarantee that your issue will be solved since we are talking about: "Properly destroy nodepools in state "FAILED_DESTROYING" without the need of a terraform destroy re-run". The resource was in FAILED_DESTROYING, I didn't test this but from my experience with other resources, if the deletion failed in the first place, it will fail again, I'm not sure how this worked for you, I didn't play a lot with this API.

I will discuss with my colleagues from the API about this and if indeed multiple DELETE requests can lead to the deletion of the resource (even if the resource is in FAILED_DESTRYOING state), we will implement this mechanism but I'll rather see it as a feature (having the possibility to define a number of retries for a specific request) instead of a bug-solving fix.

Answer 6 · 2024-07-29T07:53:42.000Z

terraform will throw an error on reaching any of these states, as they are not recoverable.