When ig job partially fails, KubeCF installation is left broken

Question

When ig job partially fails, KubeCF installation is left broken

Closed this issue 5 years ago · 3 comments

Sometimes, when ig job fails, some of the instance groups get created but not all. It leaves KubeCF in a broken state that is hard to debug in production or demo systems, given that the failure is not clear.

It appears that the ig containers return non-zero exit code, but the cf-operator still wipes them out of the world, not leaving obvious traces of the failure. This often leads to messages like diego-cell is missing, it's a broken KubeCF.

Suggestion:

Don't delete the ig job in case of failure.
Make the creation of instance groups all or nothing. All of them get created, or none of them do.

Answer 1 · 2020-02-24T20:24:49.000Z

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/171448445

The labels on this github issue will be updated when the story is started.

Answer 2 · 2020-04-20T09:37:22.000Z

@f0rmiga the ig job should stay around [1] - do you have a repro scenario?

Your second suggestion might not be ideal - it would mean that we can't start things as configuration becomes available. Everything would start all at once.

We're also working on improved statuses for a bunch of resources [2], that might help with this.

[1] https://github.com/cloudfoundry-incubator/quarks-job/blob/master/pkg/kube/controllers/quarksjob/job_reconciler.go#L89

[2] https://www.pivotaltracker.com/story/show/171854899

Answer 3 · 2020-04-20T20:46:10.000Z

@viovanov I could not reproduce this anymore. A change in the cf-operator might've fixed this. If it happens again I'll reopen with the repro steps.

The second suggestion indeed is not ideal.

Thanks for the links.