Clear cluster state on head node fail
approxit opened this issue · 1 comments
As we're closing to support multiple clusters, our use of ray up
/ ray down
/ ray-on-golem start
/ ray-on-golem stop
intensifies, we are encountering new problems. In the case of head node creation failure, when running a fresh ray up
, the kinda intuitive way is to call ray up
again. The problem is that the webserver has an existing "corrupted" state, and retrying ray up
is not making any progress. The user needs to know that a manual call to ray-on-golem stop
is required to proceed. Let's address that.
As Ray does not have a concept of the cluster as we do, we can tie our idea to the fate of the head node - as in Ray head node plays the role of a central single point of state.
In the case of failure in the head node setup, webserver needs to clean up the whole cluster state, to be ready for the next ray up
call.
It requires refactoring of ray service and golem service.
It is a valid UX issue.