Endless loop on terraform destroy with nodepool in state "FAILED_DESTROYING"
salyh opened this issue · 6 comments
Description
I provisioned nodepools via terraform. After the nodepool is active I want to destroy them via terraform destroy.
Then terraform runs for more than 30 minutes trying to destroy the nodepool. Looking into the DCD UI i see that the status is "FAILED_DESTROYING" (see screenshot).
When I now cancel the terraform run via STRG-C and rerun it the nodepool get really destroyed in a matter of a few seconds.
Expected behavior
Properly destroy nodepools in state "FAILED_DESTROYING" without the need of a terraform destroy re-run
Environment
Terraform version:
OpenTofu v1.7.2
Provider version:
v6.4.17
OS:
References
Hello! Can you provide the Terraform plan that led to this situation? I want to reproduce this scenario.
Our Terraform provider only sends the DELETE
request and then waits for the resource to be deleted. In the case you described, the resource reached FAILED_DESTROYING
state for some reason (API-related) so the resource wasn't deleted and the loop kept going. It's not an endless loop, it's a loop that has a specific timeout.
What I think it happened in this scenario (I still need to reproduce this to be sure):
terraform destroy
sends the firstDELETE
request;DELETE
request fails, the API sets the nodepool toFAILED_DESTROYING
state;- Terraform periodically checks the deletion of the resource (the loop) but since the resource is always there, in the
FAILED_DESTROYING
state, Terraform keeps on checking; - you cancel the previous command and then run
terraform destroy
again which sends anotherDELETE
request, this finalDELETE
request successfully deletes the nodepool;
Related to the description from Expected behavior:
Properly destroy nodepools in state "FAILED_DESTROYING" without the need of a terraform destroy re-run
As I said above, the provider only sends DELETE
requests to the API and waits for the resource to be deleted, the deletion process is handled by the API. If the API sets the resource in FAILED_DESTROYING
state, it means that something went wrong during the deletion, it has nothing to do with the TF provider, the provider only sends the requests.
The provider did the job, the DELETE
request was sent, the fact that the resource was in FAILED_DESTROYING
has nothing to do with the provider. You choose to run terraform destroy
again, so basically you sent another DELETE
request on a resource that was in FAILED_DESTROYING
and somehow it worked, but this is solely related to the API.
It's not an endless loop, it's a loop that has a specific timeout.
What is the timeout?
The provider did the job, the DELETE request was sent, the fact that the resource was in FAILED_DESTROYING has nothing to do with the provider. You choose to run terraform destroy again, so basically you sent another DELETE request on a resource that was in FAILED_DESTROYING and somehow it worked, but this is solely related to the API.
Mhh, that is passing the buck to and from each other. I think the provider needs to handle API failures appropriately.
For the plan and logs please refer to Internal support ticket Ticket 207171709
What is the timeout?
3 hours, I will also leave a reference to that.
Mhh, that is passing the buck to and from each other. I think the provider needs to handle API failures appropriately.
It really isn't, besides interrupting the loop when the resource reaches a FAILED_DESTROYING
state (I will analyze the implication of this) I don't think that something useful can be implemented. The provider is responsible with taking the data from the tf
file and sending it to the API in case of a create/update command and with sending a DELETE
request in case of a resource deletion. Let's say that, in the provider, when you receive a FAILED_DESTROYING
we send a DELETE
request again. There is no guarantee that the result will be different. Also, how many requests should we send before understanding that something is really not working inside the API?
The DELETE
request should be done once and the API should take care of it. The provider only needs to check that the resource is properly deleted before informing the user and reflecting that change in the tf
state.
I disagree - the provider can (and should) indeed send multiple DELETE requests because:
- It wouldn't do any harm
- DELETE is idempotent
- deal better with api failures
@salyh I agree with the points presented above but still, sending multiple DELETE
requests doesn't guarantee that your issue will be solved since we are talking about: "Properly destroy nodepools in state "FAILED_DESTROYING" without the need of a terraform destroy re-run"
. The resource was in FAILED_DESTROYING
, I didn't test this but from my experience with other resources, if the deletion failed in the first place, it will fail again, I'm not sure how this worked for you, I didn't play a lot with this API.
I will discuss with my colleagues from the API about this and if indeed multiple DELETE
requests can lead to the deletion of the resource (even if the resource is in FAILED_DESTRYOING
state), we will implement this mechanism but I'll rather see it as a feature (having the possibility to define a number of retries for a specific request) instead of a bug-solving fix.
terraform will throw an error on reaching any of these states, as they are not recoverable.