cloudfoundry-incubator/quarks-operator

When ig job partially fails, KubeCF installation is left broken

Closed this issue · 3 comments

Sometimes, when ig job fails, some of the instance groups get created but not all. It leaves KubeCF in a broken state that is hard to debug in production or demo systems, given that the failure is not clear.

It appears that the ig containers return non-zero exit code, but the cf-operator still wipes them out of the world, not leaving obvious traces of the failure. This often leads to messages like diego-cell is missing, it's a broken KubeCF.

Suggestion:

  1. Don't delete the ig job in case of failure.
  2. Make the creation of instance groups all or nothing. All of them get created, or none of them do.

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/171448445

The labels on this github issue will be updated when the story is started.

@f0rmiga the ig job should stay around [1] - do you have a repro scenario?

Your second suggestion might not be ideal - it would mean that we can't start things as configuration becomes available. Everything would start all at once.

We're also working on improved statuses for a bunch of resources [2], that might help with this.

[1] https://github.com/cloudfoundry-incubator/quarks-job/blob/master/pkg/kube/controllers/quarksjob/job_reconciler.go#L89

[2] https://www.pivotaltracker.com/story/show/171854899

@viovanov I could not reproduce this anymore. A change in the cf-operator might've fixed this. If it happens again I'll reopen with the repro steps.

The second suggestion indeed is not ideal.

Thanks for the links.