spotify/flink-on-k8s-operator

Taskmanager ephemeral Deployment instead of Statefulset

live-wire opened this issue · 15 comments

We performed some experiments to test FlinkCluster creation time for jobs with a lot of replicas.

Experiments were performed on GKE k8s version 1.21.11-gke.900
TaskManager details: 600 replicas

  • cpu: 12
  • mem: 96Gi
  • storage: 600Gi

For my experiments, I used the following definitions:


Metric: Startup time = All pods ready

Available options :

  1. Already available Statefulsets with PVCs.

    • Startup time = 30+ minutes
    • Pros:
      • Uses PVCs so we can specify storage request per pod and one volume is only mounted on one pod at a time.
      • Since state per indexed taskmanager is saved, recovery from a lost pod(TM) is possible
    • Cons:
  2. Deployments with generic ephemeral volumes

    • Startup time = <6 minutes
    • Pros:
      • Uses PVCs internally, so we can specify storage request per pod and one volume is only mounted on one pod at a time.
      • Fast. Replicaset pods are spawned in parallel and is quite fast even for large number of replicas with attached volumes.
    • Cons:
      • Because of the ephemeral nature of the storage, if a pod is lost, so is any state associated with its volume.
  3. Deployments with emptyDir

    • Startup time = 70 seconds
    • Pros: very fast startup
    • Cons:
      • Uses node boot disk and can't provide storage requests per pod
      • Pods share the same disk on the node and there is no separation

It is worthwhile to have support for ephemeral deployment instead of statefulset for Taskmanagers in the operator (at-least for batch workloads).

I'm quite surprised by these numbers Startup time = 30+ minutes for StatefulSets; I've been working lately with StatefulSet's + PVC and Deployments + Ephemeral storage lately and the numbers are relatively on par, 3 to 6 min, since there's waiting time for PV provision; My use case uses around 400 pods.

Are you sure that when you measured that time no new node was being provisioned? because with that diff of time it does suggest that it happened.

If there's no new node being provisioned, could smth else be affecting the deployment? like the volcano scheduler?

Yes I was surprised by the difference too since they both use PVCs.
I ensured that there was no scaling taking place in the cluster at the moment of the experiments.

I ensured the volcano scheduler was being used by both by adding schedulerName: volcano to both specs and checking the pod definitions post creation too. Also, a podGroup was created for both deployment/statefulset.

Other observation was, the statefulset pods were all being spawned sequentially (not more than 5ish pods in pending/creating state at a time), so a hypothesis is that even in case of the cluster needing a node scale up, the Autoscaler sees only a few unschedulable pods at a time and provisions a few machines. In case of deployments, there are 100s of pods created simultaneously and the autoscaler is able to provision multiple nodes without so much back-and-forth.

Seems like Statefulset podManagementPolicy: Parallel is not parallel enough 😢

That's really weird, smth is definitely not right;

Yes I was surprised by the difference too since they both use PVCs.
I ensured that there was no scaling taking place in the cluster at the moment of the experiments.

I ensured the volcano scheduler was being used by both by adding schedulerName: volcano to both specs and checking the pod definitions post creation too. Also, a podGroup was created for both deployment/statefulset.

I would do the same test without volcano and target a specific nodepool with fixed number of nodes to make sure auto-scaler doesn't kick in.

Other observation was, the statefulset pods were all being spawned sequentially (not more than 5ish pods in pending/creating state at a time).

podManagementPolicy: Parallel it is parallel and you should see if the conditions are right way more that that number in creation state.

I just tested this again and is pretty consistent there's not big of diff in time between them and since StatefulSets serve both streaming and batch use cases I would rather just stick with StatefulSets and not increase complexity in the operator.

FYI you can also use ephemeral volumes with statefulsets.

Elastic Nodepool:

Statefulset with the same config as above:
With Volcano:

  • creationTimestamp: "2022-05-30T16:15:00Z"
    • lastTransitionTime: "2022-05-30T16:44:45Z"
      reason: tasks in gang are ready to be scheduled
      Startup time = 30mins

Without Volcano:
Startup time = 26mins

podManagementPolicy: Parallel it is parallel and you should see if the conditions are right way more that that number in creation state.

Indeed, I could see many more pods in the Creating state at a time, but the startup time was still disappointing.

FYI you can also use ephemeral volumes with statefulsets.

Yes. Since it is part of the pod spec. 👍

Static Nodepool results to follow.

jto commented

@regadas Could you share the resources definition you used? Something has to be different between yours and @live-wire tests.
Which version of kubernetes were you using ?

jto commented

podManagementPolicy: Parallel it is parallel

It's not 100% parallel. The pods are still created sequentially. It will not wait for a pod to be ready to create the next one, but it will still create pods one by one.
If for some reason it takes 2s to create a pod, with 600 replicas, it will take 20 minutes to create all the pods and a bit longer for all of them to be ready.
I''m not too sure why it takes 2s to create a single pod. Maybe because of the admission controllers.

Hey yeah! I can definitely share let me prep that and adapt to flink.

Which version of kubernetes were you using ?

1.21.11-gke.1100

The pods are still created sequentially.

This gets me thinking; the fact that you are seeing this with the policy set to Parallel suggests to me that smth else is at play; Sequential should only happen if OrderedReady is used.

Maybe because of the admission controllers.

Yeah maybe 🤷

jto commented

This gets me thinking; the fact that you are seeing this with the policy set to Parallel suggests to me that smth else is at play; Sequential should only happen if OrderedReady is used.

No that's just how it works sadly. StatefulSet are implemented as a custom controller. You can see the implementation here: https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/statefulset/stateful_set_control.go#L396=

If there's something "slow" in this loop, (creating the PVC or smth else), it will add up with each pod creation. It be great to have more visibility on what's slowing down the process, but I'm not sure we have that.

No that's just how it works sadly. StatefulSet are implemented as a custom controller. You can see the implementation here: https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/statefulset/stateful_set_control.go#L396=

Ahah this is great! 👍 I see that yeah there's potential there to take more time when sending the request to create the PVC; I'm interested to see if that's not he case as well for Deployments? need to have a look;

Anyway, what I used for test is here https://github.com/regadas/shenanigans

❯ kubectl -n flink-dev get statefulsets/flink-taskmanager -o json | jq .metadata.creationTimestamp
"2022-05-31T09:22:13Z"
❯ kubectl -n flink-dev get pods -o json | jq '[ .items[].status.conditions[] | select(.type == "Ready").lastTransitionTime ] | max'
"2022-05-31T09:26:47Z"

Well deployments are slightly different https://github.com/kubernetes/kubernetes/blob/aae07c6f7bb715750a2d0cc77cf81c4e31a4904b/pkg/controller/replicaset/replica_set.go#L759 but they do also wait for the batch to complete; So could a deployment + generic ephemeral volumes see issues as well?

The more I think and look at this the more I feel that there are a few other things that can influence this and I don't think it will be easy to replicate. The truth is that Deployments have potential to be/are faster no matter what since they are simpler/less overhead then the StatefulSet.

That said, I think it's worth having a branch with the Deployment and give it a go and see how it behaves before rolling it out. I just wanted to understand the reason behind such a big difference before adding more to the operator.

jto commented

Deployments are different indeed. If I understand things correctly, the pods are created in batch in goroutines so they are not affected in the same way by delays in the pod creation.

Indeed. Deployments increase the batch size to 2^x each cycle(x) thereby speeding up as batch size increases. While statefulset pod creation batch size remains fixed to 1 :(


For my experiments above, I used the following definitions:


I tried deploying your definitions from https://github.com/regadas/shenanigans with
cpu: 12, mem: 96Gi, storage: 600Gi (for parity)

STATEFULSET

  • Elastic nodepool startup time = 25 mins
  • Static nodepool startup time = 23 mins

EPHEMERAL DEPLOYMENT

  • Elastic nodepool startup time = 5 mins 20 seconds
  • Static nodepool startup time = < 5 mins

This could be because the statefulset controller in the cluster I'm testing on is busier than the one in your cluster. But this will always be the case in a production setting. Also, we would have elastic nodepools + volcano-enabled in a prod setting making startup times worse.

just out of curiosity how do the numbers look like when using StatefulSets with ephemeral volumes?