cloudfoundry/silk-release

silk-daemon drain script times out prematurely

univ0298 opened this issue · 4 comments

Issue

The silk-daemon drain script needs to wait for however long the rep evacuation timeout has been set to. But instead the silk-daemon drain is hardcoded to wait for 200 seconds maximum (40 loops of 5 seconds). In larger environments where the rep evacuation timeout has to be set above 200 seconds, this causes big problems because the termination of the silk-daemon breaks the cleanup and evacuation process. This in turn leads to the worst possible consequence in Cloud Foundry - application outages for users.

Context

You can see in this code that there are hardcoded timeouts in the drain
https://github.com/cloudfoundry/silk-release/blob/master/src/silk-daemon-shutdown/main.go#L45-L46 and https://github.com/cloudfoundry/silk-release/blob/master/src/silk-daemon-shutdown/main.go#L80

	fileCheckInterval := flag.Int("containerMetadataFileCheckInterval", 5, "interval (seconds) between checks to the metadata file")
	fileCheckTimeout := flag.Int("containerMetadataFileCheckTimeout", 600, "timeout (seconds) when checking the metadata file")
and
	fileCheckMaxAttempts := 40

This is actually a bug already because the 600 seconds overall timeout is ineffective and unused since the 40 attempts at 5 seconds interval is reached first instead.

But neither hardcoded timeout approach is the right way to do things anyway. Instead the timeout must be aligned with the evacuation timeout that is set on rep, since rep needs silk-daemon to stay alive during this time period
https://github.com/cloudfoundry/diego-release/blob/develop/jobs/rep/spec#L49

  diego.rep.evacuation_timeout_in_seconds:
    description: "The time to wait for evacuation to complete in seconds"
    default: 600

Perhaps a bosh link could be used to get that rep evacuation time and pass it to the silk-daemon drain.

As a partial quick fix, the attempts loop could be removed and just have the 600 seconds timeout. That will at least be more generous that the 200 seconds which is definitely too short for production scale environments. The 600 could be raised a bit further even perhaps. But this would only be a quick and dirty improvement - not the real fix that needs to use the rep evacuation timeout.

Issue exists in CFD 16.12 with will release 2.36.0

Steps to Reproduce

Setup a system where a cell is quite loaded with apps (say 100 apps on the cell)
Configure a corresponding rep evacuation_timeout_in_seconds to something high like 30 minutes (1800 seconds)
Initiate a drain on the cell (for example bosh stop the cell)
Watch the /var/vcap/sys/log/silk-daemon/drain.log file give up and kill silk daemon after 200 seconds (40 loops)

Expected result

silk-daemon should stay alive and keep waiting on drain until the rep drain has finished (as shown in /var/vcap/sys/log/rep/drain.log and also as indicated by the container metadata file showing no remaining containers on the cell

Current result

The evacuation of apps doesn't work properly because the silk-daemon is killed prematurely after just 200 seconds

Possible Fix

Discussed already above

Additional Context

https://cloudfoundry.slack.com/archives/C01ABMVNE9E/p1619198128045700
https://cloudfoundry.slack.com/archives/C01ABMVNE9E/p1619444692046300

We have created an issue in Pivotal Tracker to manage this. Unfortunately, the Pivotal Tracker project is private so you may be unable to view the contents of the story.

The labels on this github issue will be updated when the story is started.

So it turns out that the latest code did actually address some of the more basic concerns I'd seen (sorry I looked at older code than I should have).

This commit attempted to increase the overall wait from 300 seconds to 600 : 1496631#diff-978cb0b31e667752e9a85a40a5832ada6586760428a69605c5b5d4d3d39bdd44

and then this commit realized the problem with the max attempts loop that was limiting the timeout to 40 loops of five seconds for a 200 second max: 4da5176#diff-978cb0b31e667752e9a85a40a5832ada6586760428a69605c5b5d4d3d39bdd44

So after that second fix was done, things are better than I described above because now there is a 600 second timeout and this matches the default rep drain timeout of 10 minutes.

But the problem comes in when someone sets a larger-than-default drain timeout on rep with the diego.rep.evacuation_timeout_in_seconds property. In that case, there is still a bug here in silk-daemon's drain because it will timeout after 600 seconds even though the rep may still be actively draining until evacuation_timeout_in_seconds is reached.

In a nutshell the bug is therefore only that silk-daemon drain timeout is hardcoded to a 600 seconds limit, when in fact it should be picking up the rep's evacuation_timeout_in_seconds property and using that as its timeout.

@univ0298 hey, sorry it's taken so long for someone to get back to you on this!

Is this still an issue for you in the current silk-release? Looking through the code + job specs it seems like it likely is, but just wanted to confirm.

Closing due to lack of response and age of the issue, but feel free to re-open or open a new issue if this is still a thing for you!