dask/dask-kubernetes

One to one deployment to worker relationship

Closed this issue · 6 comments

Describe the issue:
It looks like dask-kubernetes create one deployment per worker. Is this intentional? Would it make more sense to have one deployment for each worker group (i.e. a one to many relationship between deployment and workers), since deployments have a replica field that propagates to the underlying replicaset and ultimately informs how many individual pods there are? I'm using custom_cluster_spec, so there's a possibility that this issue is specific to that.

Minimal Complete Verifiable Example:

from dask_kubernetes.operator import KubeCluster

config = DaskClusterConfig()
c = KubeCluster(custom_cluster_spec=<foo>)
c.scale(4)
kubectl get deployments

There will be 5 deployments, 1 for the scheduler and 1 for each worker.

Anything else we need to know?:

Environment:

  • Dask version:
    2023.9.2
  • Dask Kubernetes version:
    2023.9.0
  • Python version:
    3.9.17
  • Operating System:
    macOS 13.6
  • Install method (conda, pip, source):
    pip

Hey @droctothorpe!

The one-to-one relationship is actually intentional! The reason behind this being is that dask maintains it's own (more-complex) logic around which workers to remove in case of a scale-down event. Those scheduling decisions don't just kill a random worker (as it would happen with a k8s scale down), but have more information available like how much work is left for a worker, what data it holds, etc.

(hope I'm not saying anything wrong here @jacobtomlinson)

Yep exactly as @bstadlbauer says. Deployments aren't intelligent enough to handle scaling workers due to the semi-stateful nature of Dask workers. We used to just create the Pods directly, but there are some benefits to wrapping them in a Deployment, like the Pod gets recreated automatically if they get evicted from a node.

Thanks for sharing the rationale, guys. Not sure it would help, but a statefulset might be useful in this context.

One small concern with one deployment per worker is the sheer volume of K8s resources. It's not unusual for our users to spin up clusters with hundreds of workers. 800 workers (we have users operating at this scale) means 800 deployments, 800 replicasets, and 800 pods. That's 2400 resources for just one cluster!

I think we would have the same problem with StatefulSet because the Dask scheduler may want to remove arbitrary workers, and stateful sets do ordered scaling.

I hear your concern. Are you seeing control plane problems with that number of resources? In theory Kubernetes can support up to 150k pods, so I would expect it to handle users like this just fine. I wonder if directly using a ReplicaSet would still give us the benefits we need but would reduce the number of resources by a third.

Haven't seen issues yet and don't want to optimize prematurely. Eliminating deployments if they're not actually necessary SGTM, FWIW.

I think I'm tempted to say let's close this out as premature optimization as you say. But if folks start running into problems we can explore this topic again in the future.