cloudfoundry-incubator/quarks-operator

Please support drain scripts

Closed this issue · 5 comments

Is your feature request related to a problem? Please describe.
BOSH supports drain scripts, and I'd like to use one for some kubecf work (to dynamically create and remove application security group rules to support credhub).

Describe the solution you'd like
If a job script /var/vcap/jobs/…/bin/drain exists, I'd like it to be executed on pod termination.

I don't have an opinion on if this should be triggered on SIGTERM of container-run, or as a preStop hook on the container.

Describe alternatives you've considered
Attempting to make an extra job where the run script triggers the drain script manually on shutdown. (Draining is idempotent in this case, anyway.) This triggers correctly, but the pod goes away before the script can finish.

I don't see anything in the docs that looks like it describes a timeout for shut down. (It does reference hooks, but with the comment that it's not implemented.)

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/171870504

The labels on this github issue will be updated when the story is started.

@mook-as we have some tests for drain scripts (they should be working)
https://github.com/cloudfoundry-incubator/cf-operator/blob/3eafb4ed26e316fe32ea3b2e0d29afea756a4ff7/integration/lifecycle_test.go#L87

Do you have a sample or maybe a kubecf branch where you tried this out?

I'm currently working on mook-as/kubecf/credhub-sec-group-scf-helper + mook-as/scf-helper-release/kubecf/credhub-asgs — the relevant is credhub-setup.

I added a temporary BOSH property, credhub_setup.use_drain, which if set to false will use the mentioned workaround of manually triggering the drain script on exit of the main run script. Neither work; with the workaround in place, you can see in the credhub-setup job (in either uaa or credhub group) the drain exiting half way. Without the workaround, that doesn't even get triggered. Either way, the expected behaviour on drain should be that the cf security-group corresponding to the pod gets removed.

The most reliable way to test is probably:

  1. Deploy with sizing.credhub.instances = 2.
  2. Wait for cf security-groups to show the credhub-internal-kubecf-credhub-1 group get created.
  3. Scale credhub down to 1.
  4. Note that the logs don't show the drain being run (it should).
  5. Check that the credhub-internal-kubecf-credhub-1 security group still exists (it shouldn't).

It is possible that I just have a bug somewhere in my code, but I at least expected some output from my code.

manno commented

@mook-as I can't comment on your workaround. Regarding the existing drain script support in the operator, I think this is the relevant code segment: https://github.com/cloudfoundry-incubator/cf-operator/blob/fb39a29ad746849d8d6cc4177f54a8ee6357dfe8/pkg/bosh/bpmconverter/container_factory.go#L552

Apparently we support multiple drain scripts in a directory named 'drain'. Which would exlain why your script wasn't executed. Could you try to put your script(s) in /var/vcap/jobs/…/bin/drain/script1.sh.

manno commented

ping @mook-as
I'm going to close this :)