Yolean/kubernetes-kafka

Use of pzoo and zoo as persistent/ephemeral storage nodes

Closed this issue · 10 comments

Hi,

In Zookeeper we have the notion of persistent/ephemeral nodes, but I'm struggling to understand why these concepts have been used here in terms of persistent volumes in K8s.

Can someone elaborate a bit further on what the objectives are for this intentional configuration?

Thanks.

It was introduced in #34 and discussed in #26 (comment).

The case for this has weakened though, with increased support for dynamic volume provisioning across different kubernetes setups, and with this setup being used for heavier workloads. I'd perfer if the two statefulsets could simply be scaled up and down individually. For example if you're on a single zone you don't have the volume portability issue. In a setup like #118 with local volumes howerver it's quite difficult to ensure quorum capabilities on single node failure.

Unfortunately Zookeeper's configuration is static, prior to 3.5 wich is in development. Adapting to initial scale would be doable, I think. For example the init script could use the Kubernetes API to read the desired number of replicas for the both StatefulSets and generate the server.X strings accordingly.

Hi @solsson . I have the same confusion. I've read your comment on the other issue, but it didn't help me understand why two node are using empty_dir but not persistent volumn. Could you elaborate a little more as to under what scenario they will be useful? How does it compare to use persistent volume for all 5 nodes? I'm running my Kubernetes cluster on AWS, with 6 worker nodes spreading across 3 availability zones. Thanks.

Good that you question this. The complexity should be removed if it can't be motivated. I'm certainly prepared to switch to all-persistent Zookeeper.

The design goal was to make the persistent layer as robust as the services layer. Probably not as robust as bucket stores or 3rd party hosted databases, but same uptime as your frontend is good enough.

Thus workloads will have to migrate in the face of lost availability zones, like non-stateful apps will certainly do with Kubernetes. I recall https://medium.com/spire-labs/mitigating-an-aws-instance-failure-with-the-magic-of-kubernetes-128a44d44c14 "a sense of awe watching the automatic mitigation".

Unless you have a volume type that can migrate, the problem is that stateful pods will only start in the zone where the volume was provisioned. With both 5 and 7 node zk across 3 zones, if a zone with 2 or 3 zk pods repsectively goes out, you're -1 pod away from losing a majority of your zk. My assumption is that lost majority means your service goes down. Zone outage can be extensive, as in the AWS case above, and due to zk's static configuration you can't reconfigure to adapt to the situation as it would cause the -1.

With kafka brokers you can throw money at the problem: increase your replication factor. With zk you can't. Or maybe you can, with scale=9?

@solsson I've tried to rephrase the reason for having pzoo and zoo below. Let me know what you think:

AFAICT, there are at least two types of failures for which there should be some protection.

  • Software errors: This is where something goes wrong with a Zookeeper pod that results in it going down. There is nothing wrong with the underlying infrastructure.

  • Infra errors: Underlying AWS/cloud infrastructure went down.

If there are 3 AZs, the 5 ZK pods are spread across these 3 AZs. If an AZ goes down, there is little benefit to be had of having 5 ZK pods since the AZ that went down could result in 2 ZK pods being lost. The ZK cluster is 1 more failure away from being unavailable. The situation would be the same if there were only 3 ZK pods and 1 AZ went down.

However, for software errors, each pod could go down by itself and having 5 ZK nodes helps because it can tolerate 2 individual pod failures (instead of 1 in the 3ZK case).

While having only 3 EBS volumes instead of 5 does keep costs low, to avoid confusion, it would be better to have a single statefulset of pzoo with 5 nodes.

While having only 3 EBS volumes instead of 5 does keep costs low, to avoid confusion, it would be better to have a single statefulset of pzoo with 5 nodes.

@shrinandj I think I agree at this stage. What would be even better, in particular now (unlike in the k8s 1.2 days) that support for automatic volume provisioning can be expected, would be to support scaling of the zookeeper statefulset(s). That way everyone can descide for themselves, and we can default to 5 persistent pods. Should be quite doable in the initscript, by retrieving desired number of replicas with kubect. I'd be happy to accept PRs for such things.

Can you elaborate a bit on that?

  • The default will be a statefulset with 5 pods.
  • Users can scale this up if needed by simply increasing the number from 5 to whatever using kubectl scale statefulsets pzoo --replicas=<new-replicas>. This should create the new PVCs and then run the pods.

What changes are required in the init script?

Sounds like a good summary, and my ideas for how are sketchy at best. Sadly(?) this repo has come of age already and needs to consider backwards compatibility. Hence we might want a multi-step solution:

  1. Add volume claims to the zoo statefulset, keep the init script as is.
  2. Add an ezoo (ephemeral) statefulset as a copy of the "old" zoo, for the multi-zone frugal use case, but with replicas=0.
  3. Include the above kubernetes-kafka release.
  4. Add a branch (for evaluation by those who dare) that generates the server entries based on kubectl -n kafka get statefulset zoo -o=jsonpath='{.status.replicas}' (and equivalent for pzoo - deprecated - and ezoo).
  5. If this is looking good, change defaults to replicas=5 for zoo and replicas=0 for pzoo+ezoo, with a documented migration procedure in release notes.

@solsson I understand that the steps mentioned above are needed due to backwards compatibility, but in case I want 5 pzoos I just need to change the replication to 5 and remove the zoo statefulset, right?

@AndresPineros You'll also need to change the server.4 and server.5 lines in 10zookeeper-config.yml and prepend the p.

See #191 (comment) for the suggested way forward.