leg100/otf

Terraform state not updated on a failed run

Closed this issue · 8 comments

I have a run with a series of resources, the plan finishes ok.
During the apply, some of the resources are created, but the run fails mid execution due to some problem which prevents a pod from being created.
Then it fails, like this:

Terraform has been successfully initialized!
module.oss_prod_sonarqube_12.module.postgresql.kubernetes_secret.this: Creating...
module.oss_prod_sonarqube_12.aws_ebs_volume.this: Creating...
module.oss_prod_sonarqube_12.module.postgresql.aws_ebs_volume.this: Creating...
module.oss_prod_sonarqube_12.aws_route53_record.hostname[0]: Creating...
module.oss_prod_sonarqube_12.aws_ebs_volume.this: Still creating... [10s elapsed]
module.oss_prod_sonarqube_12.module.postgresql.aws_ebs_volume.this: Still creating... [10s elapsed]
module.oss_prod_sonarqube_12.aws_ebs_volume.this: Creation complete after 11s [id=vol-0b79209ea04d62be9]
module.oss_prod_sonarqube_12.module.postgresql.aws_ebs_volume.this: Creation complete after 11s [id=vol-04889688362e7f912]
module.oss_prod_sonarqube_12.kubernetes_persistent_volume.this: Creating...
module.oss_prod_sonarqube_12.module.postgresql.kubernetes_persistent_volume.this: Creating...
╷
│ Error: persistentvolumes "sonarqube-12-postgresql" already exists
│
│   with module.oss_prod_sonarqube_12.module.postgresql.kubernetes_persistent_volume.this,
│   on .terraform/modules/oss_prod_sonarqube_12.postgresql/local_volume.tf line 24, in resource "kubernetes_persistent_volume" "this":
│   24: resource "kubernetes_persistent_volume" "this" {
│
╵
╷
│ Error: secrets "sonarqube-12-postgresql-credentials" already exists
│
│   with module.oss_prod_sonarqube_12.module.postgresql.kubernetes_secret.this,
│   on .terraform/modules/oss_prod_sonarqube_12.postgresql/main.tf line 24, in resource "kubernetes_secret" "this":
│   24: resource "kubernetes_secret" "this" {
│
╵
╷
│ Error: creating Route 53 Record: InvalidChangeBatch: [Tried to create resource record set [name='sonarqube-12.oss.ossy.dev.', type='CNAME'] but it already exists]
│ 	status code: 400, request id: e4654ae8-c869-462f-9b74-f4f2bd742b95
│
│   with module.oss_prod_sonarqube_12.aws_route53_record.hostname[0],
│   on .terraform/modules/oss_prod_sonarqube_12/dns.tf line 5, in resource "aws_route53_record" "hostname":
│    5: resource "aws_route53_record" "hostname" {
│
╵
╷
│ Error: persistentvolumes "sonarqube-12" already exists
│
│   with module.oss_prod_sonarqube_12.kubernetes_persistent_volume.this,
│   on .terraform/modules/oss_prod_sonarqube_12/local_volume.tf line 30, in resource "kubernetes_persistent_volume" "this":
│   30: resource "kubernetes_persistent_volume" "this" {
│
╵
 
Error: exit status 1: Error: persistentvolumes "sonarqube-12-postgresql" already exists with module.oss_prod_sonarqube_12.module.postgresql.kubernetes_persistent_volume.this, on .terraform/modules/oss_prod_sonarqube_12.postgresql/local_volume.tf line 24, in resource "kubernetes_persistent_volume" "this": 24: resource "kubernetes_persistent_volume" "this" { Error: secrets "sonarqube-12-postgresql-credentials" already exists with module.oss_prod_sonarqube_12.module.postgresql.kubernetes_secret.this, on .terraform/modules/oss_prod_sonarqube_12.postgresql/main.tf line 24, in resource "kubernetes_secret" "this": 24: resource "kubernetes_secret" "this" { Error: creating Route 53 Record: InvalidChangeBatch: [Tried to create resource record set [name='sonarqube-12.oss.ossy.dev.', type='CNAME'] but it already exists] status code: 400, request id: e4654ae8-c869-462f-9b74-f4f2bd742b95 with module.oss_prod_sonarqube_12.aws_route53_record.hostname[0], on .terraform/modules/oss_prod_sonarqube_12/dns.tf line 5, in resource "aws_route53_record" "hostname": 5: resource "aws_route53_record" "hostname" { Error: persistentvolumes "sonarqube-12" already exists with module.oss_prod_sonarqube_12.kubernetes_persistent_volume.this, on .terraform/modules/oss_prod_sonarqube_12/local_volume.tf line 30, in resource "kubernetes_persistent_volume" "this": 30: resource "kubernetes_persistent_volume" "this" {

The problem is, in case of a failure, the state file is not updated, then the resources created by this run are not under Terraform control.
If I decide to delete the workspace, any objects created during the failed run are ignored and left in the account.

So, in case of a failed run, the state file should be updated with references to any resource created during the run.

This behavior differs from Terraform Cloud which creates multiple state files during a single run and keeps all the created objects on the state.

leg100 commented

This behavior differs from Terraform Cloud which creates multiple state files during a single run and keeps all the created objects on the state.

It looks like this functionality - not specific to Terraform Cloud - was added only recently, in v1.5.0:

https://github.com/hashicorp/terraform/blob/v1.5/CHANGELOG.md#150-june-12-2023

Terraform CLI's local operations mode will now attempt to persist state snapshots to the state storage backend periodically during the apply step, thereby reducing the window for lost data if the Terraform process is aborted unexpectedly. (hashicorp/terraform#32680)

OTF currently runs terraform apply and if it fails it skips uploading the state file. I'll change the behaviour so that the state file is uploaded regardless of whether it failed or succeeded.

Thank you for bringing this to my attention.

Hi @leg100

I have little experience with Golang so please let me see if I understand this.

The worker finishes the phase and regardless of the outcome it invokes the FinishPhase but it will set "Errored" in the finishOptions.

// Regardless of success, mark phase as finished
_, err = w.FinishPhase(ctx, r.ID, r.Phase(), finishOptions)
if err != nil {
log.Error(err, "finishing phase")
return
}
}

In the service.FinishPhase it creates a report for output (if not Errored) then it updates the state in the DB. It still passes a reference to the error.

otf/internal/run/service.go

Lines 324 to 326 in 6b9e6b1

run, err := s.db.UpdateStatus(ctx, runID, func(run *Run) error {
return run.Finish(phase, opts)
})

In the DB updateStatus method it checks if the run was errored and it it was it won't update the status.

otf/internal/run/db.go

Lines 113 to 115 in 6b9e6b1

if err := fn(run); err != nil {
return err
}

So if I remove those 3 lines in db.go it should continue saving the state even in case of an error which is what this issue is about.

Is that right?

leg100 commented

@Spritekin That part of code is concerned with updating the run status, not terraform state.

See the fix I've made here:

https://github.com/leg100/otf/pull/539/files

If you read ./internal/agent/steps.go from the top then you'll get an understanding of what's going on. Each run is composed of many steps. Each step is processed sequentially. If one step fails, it stops and later steps are skipped.

The problem in this case is that the terraform apply step fails and so it would skip the next and last step that uploads the state.

The fix is to combine the two steps into one. In Go, a defer() is called at the end of a function regardless of error. So that is what it used to ensure any change to the state is uploaded regardless of what happens to the apply.

leg100 commented

@Spritekin It would be appreciated if you can give the new release a test run, and even better if you can replicate the error scenario above, to see if it successfully persists partial state updates.

sure I will do now... I'm rebuilding my images with AWS CLI support.

https://hub.docker.com/r/spritekin/otfd

Now I can't replicate the failure... don't worry I will build a new module soon, new modules usually fail during tests. Will keep you posted.

pat-s commented

@leg100 maybe it would be helpful to list custom images like this or mine in the Readme? To prevent other users from building duplicate ones unnecessarily

@leg100 I'm still testing but looking good. I say this because in a first run I got a series of objects being created then my module failed... the apply phase looks like this:

Terraform has been successfully initialized!
module.oss_prod_penpot_16.module.postgresql.kubernetes_secret.this: Creation complete after 2s [id=n1/penpot-16-postgresql-credentials]
module.oss_prod_penpot_16.module.redis.kubernetes_secret.this: Creation complete after 2s [id=n1/penpot-16-redis-credentials]
module.oss_prod_penpot_16.module.postgresql.aws_ebs_volume.this: Creation complete after 10s [id=vol-01aa843277aca7c3b]
module.oss_prod_penpot_16.aws_ebs_volume.this: Creation complete after 10s [id=vol-08e4a387cf4a974d9]
module.oss_prod_penpot_16.module.redis.aws_ebs_volume.this: Creation complete after 10s [id=vol-0e7cd4a17438f3929]
module.oss_prod_penpot_16.module.postgresql.kubernetes_persistent_volume.this: Creation complete after 1s [id=penpot-16-postgresql]
module.oss_prod_penpot_16.module.redis.kubernetes_persistent_volume.this: Creation complete after 1s [id=penpot-16-redis]
module.oss_prod_penpot_16.module.postgresql.kubernetes_persistent_volume_claim.this: Creation complete after 0s [id=n1/penpot-16-postgresql]
module.oss_prod_penpot_16.kubernetes_persistent_volume_claim.this: Creation complete after 0s [id=n1/penpot-16]
module.oss_prod_penpot_16.module.redis.kubernetes_persistent_volume_claim.this: Creation complete after 12s [id=n1/penpot-16-redis]
module.oss_prod_penpot_16.module.postgresql.helm_release.postgresql: Creation complete after 23s [id=penpot-16-postgresql]
module.oss_prod_penpot_16.aws_route53_record.hostname[0]: Creation complete after 47s [id=Z05392673CT6X22PUH929_penpot-16.oss.ossy.dev_CNAME]
module.oss_prod_penpot_16.module.redis.helm_release.redis: Creation complete after 41s [id=penpot-16-redis]
╷
│ Error: template: penpot/templates/frontend/ingress.yaml:25:19: executing "penpot/templates/frontend/ingress.yaml" at <.Values.ingress.tls>: nil pointer evaluating interface {}.tls
│
│   with module.oss_prod_penpot_16.helm_release.application,
│   on .terraform/modules/oss_prod_penpot_16/main.tf line 272, in resource "helm_release" "application":
│  272: resource "helm_release" "application" {
│

But then if I try to rerun th plan already recognises the created objects and tries to refresh:

module.oss_prod_penpot_16.module.postgresql.kubernetes_secret.this: Refreshing state... [id=n1/penpot-16-postgresql-credentials]
module.oss_prod_penpot_16.module.redis.kubernetes_persistent_volume.this: Refreshing state... [id=penpot-16-redis]
module.oss_prod_penpot_16.kubernetes_persistent_volume_claim.this: Refreshing state... [id=n1/penpot-16]
module.oss_prod_penpot_16.module.postgresql.kubernetes_persistent_volume_claim.this: Refreshing state... [id=n1/penpot-16-postgresql]
module.oss_prod_penpot_16.module.redis.kubernetes_persistent_volume_claim.this: Refreshing state... [id=n1/penpot-16-redis]
module.oss_prod_penpot_16.module.postgresql.helm_release.postgresql: Refreshing state... [id=penpot-16-postgresql]
module.oss_prod_penpot_16.module.redis.helm_release.redis: Refreshing state... [id=penpot-16-redis]
 

This wouldn't happen before and it tried to recreate all objects and failed because they existed already.

Then I tried deleting the stack and the objects were delete correctly.

module.oss_prod_penpot_16.module.redis.helm_release.redis: Destroying... [id=penpot-16-redis]
module.oss_prod_penpot_16.module.postgresql.helm_release.postgresql: Destroying... [id=penpot-16-postgresql]
module.oss_prod_penpot_16.module.redis.kubernetes_secret.this: Destroying... [id=n1/penpot-16-redis-credentials]
module.oss_prod_penpot_16.module.postgresql.kubernetes_secret.this: Destroying... [id=n1/penpot-16-postgresql-credentials]

So I think this feature is good. Thank you so much!