DedicatedGameServer status - what to do if failed?
dgkanatsios opened this issue · 3 comments
A DedicatedGameServer can signal to our API Server that it has Failed. We do nothing here other than marking the entire DedicatedGameServerCollection as Failed. We should investigate if we should do something else in this case:
- restart the DedicatedGameServer (simply delete it and one will be recreated by the DedicatedGameServerCollection Controller)
- have the DedicatedGameServerCollection Controller create an extra one, so the cluster administrator can investigate the faulty one's logs
- anything else?
We could have the user select what to do via an extra flag on the DedicatedGameServerCollection.
We should also investigate what to do if DedicatedGameServers and/or their corresponding Pods are created and then Failed. If we opt to create more DedicatedGameServers and/or Pods, maybe we should set a threshold of some kind, e.g. if 30% or more of the DedicatedGameServerCollection has failed (again, either from a DedicatedGameServer or a Pod perspective) we should stop creating more DedicatedGameServers/Pods.
We will proceed to create an enumeration on the DedicatedGameServerCollection object for what the behavior would be on subsequent failure of a DedicatedGameServer.
When a DGS fails, the two options would be to either delete it or remove it from collection. Let's call this enumeration 'DGSFailBehavior' with two available options 'Remove' and 'Delete'.
- If the 'DGSFailBehavior' is set to 'Remove', we should just remove the owner of DGS and set the original DGSCol label (we can check the MarkedForDeletion code on how this is done)
- If the 'DGSFailBehavior' is set to 'Delete', we should delete the DGS from the cluster
Default value should be 'Remove'.
We should also add a 'DGSMaxFailures' integer variable on the DedicatedGameServerCollection.
If number of failures is bigger than this threshold, we should set the Collection to an 'Failed' state and do not 'Delete' or 'Remove' any more DGS.
Default value for the 'DGSFailThreshold' should be 0, i.e. if any DGS fails set the Collection to an unhealthy state and do no action.
We will keep the number of failures for each DGSCollection on an variable called 'DGSTimesFailed'
Documentation is here