Need anti-affinity policies for replica pods.
Closed this issue · 1 comments
Describe the problem/challenge you have
Distributed applications like mongodb require the volumes to be spread across multiple nodes - just like its own replicas. Cross scheduling them will cause performance and high availability issues.
Consider this case of 3 replica mongo sts. The mongo pods are neatly distributed across three different nodes:
kiran_mova_mayadata_io@kmova-dev:mongodb$ kubectl get pods -o wide | grep mongo
mongo-0 2/2 Running 0 56m 10.0.2.15 gke-kmova-helm-default-pool-30f2c6c6-1942 <none> <none>
mongo-1 2/2 Running 0 55m 10.0.0.21 gke-kmova-helm-default-pool-30f2c6c6-3jsv <none> <none>
mongo-2 2/2 Running 0 54m 10.0.1.12 gke-kmova-helm-default-pool-30f2c6c6-qf2w <none> <none>
However, the target pods are packed into single node:
kiran_mova_mayadata_io@kmova-dev:mongodb$ kubectl get pods -o wide -n openebs | grep jiva-ctrl
pvc-1b21ac95-fd9f-466f-a39b-c1e1ab6e6cb5-jiva-ctrl-75d9f46fvxng 1/1 Running 0 58m 10.0.0.22 gke-kmova-helm-default-pool-30f2c6c6-3jsv <none> <none>
pvc-96120cb1-0f36-4a53-9263-6af8b8cc5a66-jiva-ctrl-6c5db7d7hq6n 1/1 Running 0 59m 10.0.0.17 gke-kmova-helm-default-pool-30f2c6c6-3jsv <none> <none>
pvc-faa218d5-46c6-4bb7-a598-024970cf9b4c-jiva-ctrl-548585cnz9js 1/1 Running 0 59m 10.0.0.20 gke-kmova-helm-default-pool-30f2c6c6-3jsv <none> <none>
- A failure to
3jsv
will cause all mongo pods to go down. - The mongo pods on nodes other than
3jsv
will have to go over the network to access their data.
A similar issue exists (but slightly more severe) with the jiva replica pods getting scheduled to same node:
pvc-1b21ac95-fd9f-466f-a39b-c1e1ab6e6cb5-jiva-rep-0 1/1 Running 0 54m 10.0.0.24 gke-kmova-helm-default-pool-30f2c6c6-3jsv <none> <none>
pvc-96120cb1-0f36-4a53-9263-6af8b8cc5a66-jiva-rep-0 1/1 Running 0 55m 10.0.0.19 gke-kmova-helm-default-pool-30f2c6c6-3jsv <none> <none>
pvc-faa218d5-46c6-4bb7-a598-024970cf9b4c-jiva-rep-0 1/1 Running 0 55m 10.0.2.17 gke-kmova-helm-default-pool-30f2c6c6-1942 <none> <none>
- Two of the replicas are on
3jsv
- which means data for two of the mongo pods is on only3jsv
. Failure of 3jsv will cause mongo db to be lost.
Describe the solution you'd like
Jiva Volume Policies should allow specifying an anti-affinity feature that allows replica pods of a given application to be not co-located onto same node.
Anything else you would like to add:
This feature was supported with external storage Jiva Volumes - using ReplicaAntiAffinityTopoKey
and specifying a unique label to all the PVCs belonging to the same application. openebs.io/replica-anti-affinity
.
Workaround
When using single replica volumes - use local storage directly.
While migrating from older external provisioned volumes to csi volumes we will need to assign the sts pods one the same nodes as the old volume. Adding the ability to add node affinity rules for replica in the policy will help with the migration.