golemfactory/ray-on-golem

Clear cluster state on head node fail

approxit opened this issue · 1 comments

As we're closing to support multiple clusters, our use of ray up / ray down / ray-on-golem start / ray-on-golem stop intensifies, we are encountering new problems. In the case of head node creation failure, when running a fresh ray up, the kinda intuitive way is to call ray up again. The problem is that the webserver has an existing "corrupted" state, and retrying ray up is not making any progress. The user needs to know that a manual call to ray-on-golem stop is required to proceed. Let's address that.

As Ray does not have a concept of the cluster as we do, we can tie our idea to the fate of the head node - as in Ray head node plays the role of a central single point of state.

In the case of failure in the head node setup, webserver needs to clean up the whole cluster state, to be ready for the next ray up call.

It requires refactoring of ray service and golem service.
It is a valid UX issue.