ray-project/ray

[core] Support of decommission(gracefully shutdown) of worker node

Opened this issue · 0 comments

Description

Hi Team,
I am here to ask for a feature to decommission a worker node, which means shutdown worker node gracefully, by
1, stop accept new task request, wait for all running task to finish (with a timeout)
2, stop all actor and worker process
3, migrate all object data to other nodes
4, shutdown raylet

My initial idea is for Ray Data, thought it may also useful for other ray library. It can help Ray Data at least below cases:
1, because Ray Data doesn't support Hadoop FileSplit like Spark/Hive, Ray Data can use only 1 task to read 1 file, if one of downstream task failed, must re-read whole file
2, if one Ray Data operation materialize data to object store, and then worker is failed, all upstream tasks need to be rerun, it may be a huge works

Other data procesing framework like Spark/Hive resolve this problem by:
1, both Spark and Hive have external shuffle service, which is out the control of cluster, if one of Spark/Hive node failed, the data in external shuffle service is still there and can keep serving downstream task
2, Indeed Spark supports decommision by enabled config spark.decommission.enabled, see https://spark.apache.org/docs/latest/configuration.html

I heared from Sam(Ray Team) that Anyscale version has already supported this feature, could please open source this feature for community?

Thanks!

Use case

1, In a cluster where resource is not stable, for example in a K8S cluster where online services and offline ray inference jobs are deployed together, online service may ask for more resource in peak period, then K8S will kill worker of offline inference job, in this case we want to reduce the impaction to offline job, by allow worker to gracefully shutdown
2, If autoscaler is enabled, it may not able to recycle a worker if there is object data in the worker, but with this feature support, can can decommision the worker then recycle it