cloudfoundry/bosh

[Feature Request] External CPI calls should eventually timeout

mrosecrance opened this issue · 7 comments

Is your feature request related to a problem? Please describe.
There's a few reasons CPI calls may hang. A call between the cpi and IAAS may have network issues or the initial connection to the IAAS from the CPI may fail. As a result, the bosh task backing the call hangs indefinitely in a "running" state until manually cancelled. There's also a small chance the director -> cpi communication channel has an error. It's done using Open3 which spawns a process to call the binary but there were once upon a time deadlock issues.

Describe the solution you'd like
There should be a -very long- timeout on bosh's side to cpi calls (hours? we should probably look into what the most expensive possible calls could be). Having tasks fail with an error message prompting users to look at the cpi logs would help. It's also likely that if it is flake, users can deploy their way forwards with minimal disruption.

Describe alternatives you've considered
Timeout's can be implemented on the cpi side - but there's a fan out issues as there are multiple CPI's. It also doesn't help with possible director -> cpi hanging communication issues.

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/174783039

The labels on this github issue will be updated when the story is started.

Here's an instance https://cloudfoundry.slack.com/archives/C02HPPYQ2/p1573162453274000 where it looks to have hung connecting to the IAAS. It does look like the vsphere cpi at least has some timeout functionality

I think that it is better to have CPIs with properly configured IaaS/HTTP clients, because the CPIs can better decide when to timeout. E.g. for the OpenStack CPI connection options are even configurable via the bosh manifest. You can set http (excon) options depending on the OpenStack needs. There are also situation, where the CPIs retry and that is why it takes too long. A general safeguard on the bosh side, which can terminate the operation if something takes too long, because it is not properly configured or handled is also a good idea. It is just difficult to find an optimal timeout, but a "very long" timeout is for sure better than no timeout.

Another suggestion that Kevin brought up is having the cpi's log somewhere themselves rather than waiting for bosh to log the cpi's actions after it finishes. It would make debugging what goes wrong easier. Even in the retry issue you linked above, it didn't show logs until after the 5 hours had passed and it gave up and returned which is not an optional operational experience.

Saving the cpi's log somewhere or consuming them differently sounds to me as a prerequisite for this feature, because I'm not sure what happens if bosh has to terminate the cpi process. I think in this case the process logs are gone. But this is a good point with the cpi's log.

To wrap up what I think we've covered, I think this is a more concrete form of what we've discussed above:

  1. Switch the directors reading of the cpi logs from a final (at exit of binary) read of logs to a streaming input - should be a switch from using Open3's capture3 to popen3. This means that we don't need to ask the CPI's to do anything differently too.

  2. Add a 3 hour timeout to the cpi calls. I've made up the number but I can't think of situations in which a user would want longer. Even in the AWS We currently do not have sufficient m4.4xlarge capacity in the Availability Zone... case, I would think it would be more valuable for Operators to have the option of switching AZ or VM type. We could also make this configurable if there's pushback. If you have a better initial number to propose here, I'm happy to use that too.

I think it's also a given that we'll keep trying to push back/fix CPI's that are failing for reasons like taking too long or failing to timeout on calls. Hopefully the better logs will help us and another CPI team's find solutions faster.

Sounds reasonable. I was thinking for a feature flag, but we can wait for a pushback also on this.