hashicorp/terraform-provider-vault

[Bug]: vault_jwt_auth_backend not added to state if error during configuration

nmasur opened this issue · 3 comments

Terraform Core Version

1.8.0

Terraform Vault Provider Version

3.23.0

Vault Server Version

1.15.4+ent

Affected Resource(s)

  • vault_jwt_auth_backend

Expected Behavior

If there is an error configuring the backend during apply/creation (such as "error checking oidc discovery URL") then either one of the following should take place:

  • The backend should remain and updated in state, while the apply is considered to have failed.
  • Or the backend should be removed completely and still considered as failed to create.

Actual Behavior

If there is an error because the OIDC URL is unreachable (due to a firewall block, say), you get the following message:

Error: error updating configuration to Vault for path myjwtbackend: Error making API request.

Namespace: mynamespace
URL: PUT https://vault.mycorp.com/v1/auth/myjwtbackend/config
Code: 400. Errors:

* error checking oidc discovery URL

However, the resource is not cleaned up, nor is it added to state. This means that the Terraform provider has left the auth backend dangling on the Vault server. If you try to run it again, you'll now see this error:

* path is already in use at myjwtbackend/

This means that somebody has to go in to Vault and manually clean up the dropped resource. Ideally, this should be added to state even if it fails. Maybe it could instead be rolled back if necessary.

Relevant Error/Panic Output Snippet

vault_jwt_auth_backend.kubernetes["myjwtbackend"]: Creating...
╷
│ Error: error writing to Vault: Error making API request.
│ 
│ Namespace: mynamespace
│ URL: POST https://vault.mycorp.com/v1/sys/auth/myjwtbackend
│ Code: 400. Errors:
│ 
│ * path is already in use at myjwtbackend/
│ 
│   with vault_jwt_auth_backend.kubernetes["myjwtbackend"],
│   on policies_handler.tf line 315, in resource "vault_jwt_auth_backend" "kubernetes":
│  315: resource "vault_jwt_auth_backend" "kubernetes" {
│ 
╵
Error: Terraform exited with code 1.
Error: Process completed with exit code 1.

Terraform Configuration Files

locals {
  all_kubernetes_clusters = {
    myjwtbackend = {
      url = "https://some.invalid.url:6300"
    }
  }
}

resource "vault_jwt_auth_backend" "kubernetes" {
  for_each           = local.all_kubernetes_clusters
  description        = "Kubernetes cluster for ${each.key}"
  path               = each.key
  oidc_discovery_url = each.value.url
  bound_issuer       = each.value.url
}

Steps to Reproduce

  1. Add a vault_jwt_auth_backend resource where the OIDC discovery URL is unreachable.
  2. Run the Terraform apply to see the error in creating the resource.
  3. Run the Terraform apply again to now see "path is already in use" error.

Debug Output

No response

Panic Output

No response

Important Factoids

No response

References

No response

Would you like to implement a fix?

None

I'm also running into this, and it's proving very challenging to resolve, since the auth method backend mount exists in Vault but the configuration does not. My hope was to import the auth method and then mark it tainted, to force a replacement, but I am unable to import the created auth method, since it tries to read the auth method configuration during import.

This doesn't seem entirely like an issue with the Terraform provider; it seems like the underlying Vault API call is creating the auth mount successfully, and then failing to create the configuration but leaving the mount dangling. Fixing that would seem to be more invasive than updating the Terraform provider to consider the auth backend created but tainted.

I think the issue is that the Terraform provider has to make two underlying API calls:

  1. Enable the auth method (/sys/auth/:path).
  2. Configure the auth method (/auth/:path/config)

When step 1 is successful but step 2 fails, the auth method from step 1 is still successful. In order for the provider to rollback step 1, it would need to delete the auth method (which it might not even have permission to do in some circumstances). It would be better to add it to state as either tainted or just complete with failures.

With that in mind, in order for you to be able to import it, the provider would need to separate out the auth method from its configuration. If the provider expected you to use the generic auth backend separately from the auth backend configuration, then that would follow API more accurately. It would also mean that every time you wanted an auth backend, you would need two resources, but it would be less confusing overall.

We're also running into this problem. It makes our Terraform configuration brittle, as we have to introduce checks before to make sure this resource creation will work, but we do not have everything under our control, so we might run into the error anyhow.

What would make the problem less severe for us, would be a timeout parameter and an automatic retry until the whole resource creation succeeds. But even in this case, if we reach the timeout, the backend should be added to the state, so I don't know.