When ig job partially fails, KubeCF installation is left broken
Closed this issue · 3 comments
Sometimes, when ig
job fails, some of the instance groups get created but not all. It leaves KubeCF in a broken state that is hard to debug in production or demo systems, given that the failure is not clear.
It appears that the ig
containers return non-zero exit code, but the cf-operator still wipes them out of the world, not leaving obvious traces of the failure. This often leads to messages like diego-cell is missing, it's a broken KubeCF
.
Suggestion:
- Don't delete the
ig
job in case of failure. - Make the creation of instance groups all or nothing. All of them get created, or none of them do.
We have created an issue in Pivotal Tracker to manage this:
https://www.pivotaltracker.com/story/show/171448445
The labels on this github issue will be updated when the story is started.
@f0rmiga the ig job should stay around [1] - do you have a repro scenario?
Your second suggestion might not be ideal - it would mean that we can't start things as configuration becomes available. Everything would start all at once.
We're also working on improved statuses for a bunch of resources [2], that might help with this.