ljfranklin/terraform-resource

Timeout no longer seems to work with Concourse v7.5.0

blgm opened this issue · 2 comments

blgm commented

Our pipeline has historically had a timeout on a Terraform apply/PUT because sometimes our backend never finishes. Since we updated to Concourse v7.5.0 we have noticed that the timeout is no longer respected and the task runs forever. But other tasks in Concourse v7.5 do timeout correctly. The thinking is that there may have been some subtle changes in signal handling in Concourse v7.5.0, and the SIGTERM is sent to the put process and not propagated to the terraform CLI process. I'm not in a position revert to a previous version of Concourse, so I can't be 100% sure that the issue I'm seeing relates to Concourse and not to another component. I'm currently cleaning up by intercepting the containers and killing the terraform processes myself.

I think it would be more robust if the put process were to handle SIGTERM and programatically propagate it to the terraform process (or maybe the process group). But I'm mostly raising this in case anyone else has seen the same issue or also uses this resource with a timeout.

@blgm I pushed 6472639 which hopefully does a better job at forwarding the signals to the child terraform process. Can you try the timeout test again with ljfranklin/terraform-resource:latest or ljfranklin/terraform-resource:1.0.8?

That said, this feels like a regression in Concourse. A timeout feature shouldn't require that all resources respond perfectly to signals. Ideally Concourse would send a SIGTERM to allow for graceful shutdown then a few seconds later would force kill the container. Can you open an issue with the Concourse team to follow up on this?

blgm commented

Hi @ljfranklin, sorry it's taken a while to respond.

  1. I agree that it does look like a regression in Concourse. I've talked to the Concourse team and they think Concourse is doing the right thing. Also I've done some testing with dummy pipelines and Concourse does seem to be sending a signal to every process in the container. So while I'm highly suspicious of Concourse, I'm not in a position to raise a helpful issue as I can't explain what's wrong. For example, the following pipeline times-out just fine:
jobs:
- name: snooze
  plan:
  - config:
      image_resource:
        name: ""
        source:
          repository: ubuntu
          tag: latest
        type: docker-image
      platform: linux
      run:
        args:
        - -euxc
        - |
            trap "echo EXITING PARENT" EXIT
            bash -c 'trap "echo EXITING CHILD" EXIT;sleep 300'
        path: bash
    task: sleep
    timeout: 30s
  1. Thank you for commit 6472639. I have tried it out, and it doesn't seem to make any difference. Also, looking at the code it seems to be some handling of errors finding terraform in the PATH, but it's not got much relating to signals. Perhaps there's been a miscommunication, or perhaps the wrong change was committed. It seems like an improvement, so it's worth keeping.

  2. I've had a go at a fix myself in my fork which I'm still testing. The idea is that when terraform is run, it's run in a process group, and if the parent process receives a signal then it will end the whole process group (including any child processes). Maybe this change is very similar to one that you've tried. I'll let you know how the testing goes.