High availability deployment for istio
Opened this issue · 3 comments
Bug Description
Is there a way we can deploy Istio in a high availability setup for a single cluster?
Given just one kubeflow cluster, does it make sense to have istio be a daemon set as proposed by @kimwnasptd ?
I went through [0] and their model of High-Availability usually refers to multiple clusters using Istio and having multiple Istio control planes so that failure of a single control plane mesh can be tolerated for example.
However [0] does not really mention anything about availability in the context of a single cluster only (which will contain a single Istio control plane).
[0] https://istio.io/latest/docs/ops/deployment/deployment-models/
To Reproduce
N/A
Environment
N/A
Relevant Log Output
N/A
Additional Context
N/A
Thank you for reporting us your feedback!
The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-5502.
This message was autogenerated
The above Istio docs describe a combination of situations (single/multi cluster and single/multi networks). With an initial look I can't understand though how they would suggest we configure for each case.
I also see at some point that they describe running Istio's Control Plane in a separate cluster, but this would need a bit of investigation.
We'll try to tackle this in steps, as discussed with @ca-scribner
The first one will be to ensure that the IngressGateway Pods will have HA. This will ensure that if a pod that handle the Gateway
Istio CR is down, then the rest of Kubeflow can still be accessed.
The first approach we discussed was to
- Keep having our ingressgateway Charm to create a Deployment
- In that Deployment we will use
affinity.podAntiAffinity
https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#more-practical-use-cases
With the above we can increase the number of replicas and ensure that the Pods will not be getting scheduled in the same nodes.
(The extreme of this would be to convert the Deployment to a DaemonSet, which would create a Pod for every node)
I tried configuring the Istio IngressGateway deployment like this (in an upstream KF) and indeed it worked as expected:
spec:
affinity:
nodeAffinity: {}
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- istio-ingressgateway
In a 2-node cluster, when I set the replicas to 2 indeed the pods got scheduled in different nodes
Then when increasing the replicas to 3, it showed the expected error
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 2m39s (x3 over 12m) default-scheduler 0/3 nodes are available: 1 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: }, 2 node(s) didn't match pod anti-affinity rules. preemption: 0/3 nodes are available: 1 Preemption is not helpful for scheduling, 2 No preemption victims found for incoming pod..