karimra/gnmic

Add kubernetes type clustering option

melkypie opened this issue · 12 comments

Currently the only KV storage we can use for clustering is Consul. A nice feature would be to add Kubernetes type and store all of the key/value information in Kubernetes objects similar to how argocd does it. This would allow the user to not have to maintain another KV storage solution.
I know this is quite a big ask but I have already managed to deploy gnmic clustering on Kubernetes with consul and having this would allow me to not worry about having another KV storage. If needed I could write a guide on how to do deploy it to kubernetes and help with the serviceaccount/rolebinding/role objects and other kubernetes related things.

I'm not sure if I understand correctly but this sounds like it needs a separate component acting as a k8s controller for gnmic. It would be responsible for managing the state of the cluster.
It could be that I'm overthinking this.
How does argocd store an instance state in k8s? Do the objects have a TTL ? Can the TTL be refreshed?

A guide to deploy gnmic on k8s will be very helpful, it would fit nicely with the docs.

Argocd stores most of its configuration in Secrets (but I am sure ConfigMaps would also be fine for gnmic) and Custom Resource Definitions (which would be too much for the simple use case in gmnic) which are basically key-value stores. They don't have a specific way to set a TTL but I am sure you could just create an entry in the specific configmap with the TTL value if that is needed.
Since from what i can currently see that is being stored in Consul is just the leader of the cluster and to which instance a target belongs to which is something that ConfigMaps in k8s can easily hold. The service availability checking feature of Consul is also in k8s.

For the guide, I will start working on it right away.

Thanks for working on the guide and thanks for the details about argocd.

Consul does a little bit more that just storage.
What I meant by TTL is a way for a key (leader or target ownership) to be deleted after a certain duration if its owner does not refresh it. Consul handles this natively. The key TTL mechanism makes leader election/reelection as well as target ownership locking/transfer easy.
Consul also allows to run a long request to get notifications about services change, basically removing the need for periodic polls to discover instances of a certain service.

About using k8s as KV store for clustering, I think ownerReference can be used for leader election and target ownership:

  • At startup, each gNMIc instance/pod tries to create a ConfigMap with a well-known predefined name, the first one to create it becomes the leader. The ones that failed to become a leader, periodically check if the ConfigMap still exists and try to create it if it doesn't, the one that succeeds takes over as the new leader.
  • Then, same as clustering with Consul, the leader proceeds to dispatch targets to available gNMIc services
  • When assigned a target, each instance creates a ConfigMap indicating that it claims ownership over that target and proceeds with creating the gNMI subscriptions.
  • Each created ConfigMap will have its ownerReference field populated with a reference to the gNMIc instance that created it. If a ConfigMap doesn't have an ownerReference it is deleted by k8s GC.
  • The leader periodically goes over the list of ConfigMaps to make sure that each target has a corresponding ConfigMap with an existing owner. If a ConfigMap for a certain target is missing, the leader reassigns that target to an available gNMIc instance.
  • A liveness probe might be needed to detect a failed gNMIc pod and delete it.

I believe this should work, open to comments and suggestions, I might have missed something or expected a piece to work differently from its real behavior.
I will give this a try and get back to you.

@melkypie if you can give the 0.25.0-beta release you will be able to try k8s based clustering.
It uses leases as a locking mechanism.

The deployment method is similar to what you already did with Consul except:

  • Obviously no need to deploy a Consul cluster
  • The clustering part in the configMap becomes:
    clustering:
      cluster-name: cluster1
      targets-watch-timer: 30s
      leader-wait-timer: 30s
      locker:
        type: k8s
        namespace: gnmic # default to "default"
  • RBAC:
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: gnmic 
  name: svc-pod-lease-reader
rules:
- apiGroups: [""]
  resources: ["pods", "services"]
  verbs: ["get", "watch", "list"]
- apiGroups: ["coordination.k8s.io"]
  resources: ["leases"]
  verbs: ["get", "list", "watch", "create", "update", "delete"]
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: gnmic-user
  namespace: gnmic
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: read-pods-leases
  namespace: gnmic 
subjects:
- kind: ServiceAccount
  name: gnmic-user 
roleRef:
  kind: Role 
  name: svc-pod-lease-reader 
  apiGroup: rbac.authorization.k8s.io
  • Add the created service account to the SS spec:
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: gnmic-ss
  labels:
    app: gnmic
spec:
  replicas: 3
  selector:
    matchLabels:
      app: gnmic
  serviceName: gnmic-svc
  template:
    metadata:
      labels:
        app: gnmic
    spec:
      containers:
        - args:
            - subscribe
            - --config
            - /app/config.yaml
          image: gnmic:0.0.0-k 
          imagePullPolicy: IfNotPresent
          name: gnmic
          securityContext:
            allowPrivilegeEscalation: false
            capabilities:
              drop:
                - all
            readOnlyRootFilesystem: true
            runAsNonRoot: true
            runAsUser: 1000
          ports:
            - containerPort: 9804
              name: prom-output
              protocol: TCP
            - containerPort: 7890
              name: gnmic-api
              protocol: TCP
          resources:
            limits:
              cpu: 100m
              memory: 400Mi
            requests:
              cpu: 50m
              memory: 200Mi
          envFrom:
            - secretRef:
                name: gnmic-login
          env:
            - name: GNMIC_API
              value: :7890
            - name: GNMIC_CLUSTERING_INSTANCE_NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
            - name: GNMIC_CLUSTERING_SERVICE_ADDRESS
              value: "$(GNMIC_CLUSTERING_INSTANCE_NAME).gnmic-svc.gnmic.svc.cluster.local"
            - name: GNMIC_OUTPUTS_OUTPUT1_LISTEN
              value: "$(GNMIC_CLUSTERING_INSTANCE_NAME).gnmic-svc.gnmic.svc.cluster.local:9804"
          volumeMounts:
            - mountPath: /app/config.yaml
              name: config
              subPath: config.yaml
      serviceAccountName: gnmic-user # <-- service account name created earlier
      volumes:
        - configMap:
            defaultMode: 420
            name: gnmic-config
          name: config
  • Add a service for the gNMIc instance API, this service HAS to be called ${cluster-name}-gnmic-api
apiVersion: v1
kind: Service
metadata:
  name: cluster1-gnmic-api
  labels:
    app: gnmic
spec:
  ports:
  - name: http
    port: 7890
    protocol: TCP
    targetPort: 7890
  selector:
    app: gnmic
    clusterIP: None

I did some tests on my side, it seems to be stable even when shrinking the SS size

karim@kss:~/github.com/karimra/gnmic$ kubectl get leases
NAME                                  HOLDER       AGE
gnmic-cluster1-leader                 gnmic-ss-0   2d4h
gnmic-cluster1-targets-172.20.20.15   gnmic-ss-0   2d4h
gnmic-cluster1-targets-172.20.20.16   gnmic-ss-1   2d4h
gnmic-cluster1-targets-172.20.20.17   gnmic-ss-0   2d4h
gnmic-cluster1-targets-172.20.20.18   gnmic-ss-2   2d4h
gnmic-cluster1-targets-172.20.20.19   gnmic-ss-0   2d4h
gnmic-cluster1-targets-172.20.20.20   gnmic-ss-2   2d4h
gnmic-cluster1-targets-172.20.20.21   gnmic-ss-2   2d4h
gnmic-cluster1-targets-172.20.20.22   gnmic-ss-2   2d4h
gnmic-cluster1-targets-172.20.20.23   gnmic-ss-1   2d4h
gnmic-cluster1-targets-172.20.20.24   gnmic-ss-1   2d4h
gnmic-cluster1-targets-172.20.20.25   gnmic-ss-0   2d4h
gnmic-cluster1-targets-172.20.20.26   gnmic-ss-1   2d4h
gnmic-cluster1-targets-172.20.20.27   gnmic-ss-0   2d4h
gnmic-cluster1-targets-172.20.20.28   gnmic-ss-2   2d4h
gnmic-cluster1-targets-172.20.20.29   gnmic-ss-1   2d4h
karim@kss:~/github.com/karimra/gnmic$ 

There is no mechanism to redistribute the targets when growing the SS

It would be helpful if you could give it a go to see if it fits your needs.

Will do, I won't be able to get back to you until Tuesday as I don't have access to cluster where I could test out GNMI due to easter holidays.

I gave it a try.
From my experience it only was able to assign a target to the leader of the cluster as other non-leader instances seem to be failing to acquire locks for targets assigned to them.
So it manages to assign 1 target ( the target that leader assigns itself after failing to assign it to other instances ) and then keeps on failing to assign other targets due to them not acquiring locks although if you look at leases you can see that the lease has been created.
I am testing this on an RKE2 cluster with 3 masters and 2 workers, kubernetes version: v1.22.5+rke2r1

melkypie:~/projects/kubernetes$ kubectl get leases -n gnmic
NAME                                    HOLDER       AGE
gnmic-ip-net-monit1-leader              gnmic-ss-2   30m
gnmic-ip-net-monit1-targets-device1    gnmic-ss-2   29m
gnmic-ip-net-monit1-targets-device2   gnmic-ss-0   11s

StatefulSet.yaml

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: gnmic-ss
  namespace: gnmic
  labels:
    app: gnmic
spec:
  replicas: 3
  selector:
    matchLabels:
      app: gnmic
  serviceName: gnmic-svc
  template:
    metadata:
      labels:
        app: gnmic
        version: 0.25.0-beta
    spec:
      containers:
        - args:
            - subscribe
            - --config
            - /app/config.yaml
          image: ghcr.io/karimra/gnmic:0.25.0-beta-scratch
          imagePullPolicy: IfNotPresent
          name: gnmic
          securityContext:
            allowPrivilegeEscalation: false
            capabilities:
              drop:
                - all
            readOnlyRootFilesystem: true
            runAsNonRoot: true
            runAsUser: 1000
          ports:
            - containerPort: 9804
              name: prom-output
              protocol: TCP
            - containerPort: 7890
              name: gnmic-api
              protocol: TCP
          resources:
            limits:
              cpu: 100m
              memory: 400Mi
            requests:
              cpu: 50m
              memory: 200Mi
          envFrom:
            - secretRef:
                name: gnmic-login
          env:
            - name: GNMIC_API
              value: :7890
            - name: GNMIC_CLUSTERING_INSTANCE_NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
            - name: GNMIC_CLUSTERING_SERVICE_ADDRESS
              value: "$(GNMIC_CLUSTERING_INSTANCE_NAME).gnmic-svc.gnmic.svc.cluster.local"
            - name: GNMIC_OUTPUTS_PROM_LISTEN
              value: "$(GNMIC_CLUSTERING_INSTANCE_NAME).gnmic-svc.gnmic.svc.cluster.local:9804"
          volumeMounts:
            - mountPath: /app/config.yaml
              name: config
              subPath: config.yaml
      serviceAccountName: gnmic-user
      volumes:
        - configMap:
            defaultMode: 420
            name: gnmic-config
          name: config

RBAC.yaml

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: gnmic
  name: svc-pod-lease-reader
rules:
- apiGroups: [""]
  resources: ["pods", "services"]
  verbs: ["get", "watch", "list"]
- apiGroups: ["coordination.k8s.io"]
  resources: ["leases"]
  verbs: ["get", "list", "watch", "create", "update", "delete"]
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: gnmic-user
  namespace: gnmic
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: read-pods-leases
  namespace: gnmic
subjects:
- kind: ServiceAccount
  name: gnmic-user
roleRef:
  kind: Role
  name: svc-pod-lease-reader
  apiGroup: rbac.authorization.k8s.io

Service.yaml

apiVersion: v1
kind: Service
metadata:
  name: gnmic-svc
  namespace: gnmic
  labels:
    app: gnmic
spec:
  ports:
  - name: http
    port: 9804
    protocol: TCP
    targetPort: 9804
  selector:
    app: gnmic
  clusterIP: None
---
apiVersion: v1
kind: Service
metadata:
  name: cluster1-gnmic-api
  namespace: gnmic
spec:
  ports:
  - name: http
    port: 7890
    protocol: TCP
    targetPort: 7890
  selector:
    app: gnmic
  clusterIP: None

ConfigMap.yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: gnmic-config
  namespace: gnmic
data:
  config.yaml: |
    insecure: true
    encoding: json_ietf
    log: true

    clustering:
      cluster-name: cluster1
      targets-watch-timer: 30s
      leader-wait-timer: 30s
      locker:
        type: k8s
        namespace: gnmic

    targets:
      device1:
        address: device1:6030
        subscriptions:
          - general
      device2:
        address: device2:6030
        subscriptions:
          - general
     device3:
        address: device3:6030
        subscriptions:
          - general
      device4:
        address: device4:6030
        subscriptions:
          - general

    subscriptions:
      general:
        paths:
          - /interfaces/interface/state/counters
        stream-mode: sample
        sample-interval: 5s

    outputs:
      prom:
        type: prometheus
        strings-as-labels: true

Also adding sanitized log files ( also I noticed that gnmic seems to be logging plaintext passwords in logs which would be great if it did not do that ):
gnmic-ss-1.log
gnmic-ss-0.log
gnmic-ss-2.log

The logs are from trying it a second time, so you can't see where it created the device1 lease.

I'm not sure what is going wrong here, I re tested with a single node as well as 1 control and 2 worker nodes (1.23.4 and 1.22.7)
I'm using kind clusters.
The leader timing out and reassigning the target to another node means that the selected instance was not able to create the lease and/or maintain it.

The leader assigning the target to itself I understood, but yea the most interesting part is that the lock/lease is not being recognized by the leader although if you look at the leases it is there.
My other thought was that maybe something was wrong with RBAC but when I get a pod with kubectl using that same serviceaccount (the one that gnmic uses) it can access all of the leases so not sure what is going on there.
I will give it another try tomorrow and try deleting the whole namespace before doing it.

Finally got around to testing it and I found the error!
I had a cluster name with a - in it. So when it tries to list the leases, it replaces the cluster name - with / in here

prefix = strings.ReplaceAll(prefix, "/", "-")

It is my fault for not providing exact configs I used to deploy as then it might have been easier to debug.

EDIT: Also seems to be the case with targets having - in them

That part actually replaces / with -.
But I think you put your finger on the problem; the leader won't be able to retrieve a lock if the cluster name or the target name contains a -. Thanks for sharing your findings.

The leader keeps a mapping of the transformed key (/ --> - ) to the original key to be able to revert it back, but it can only map back the keys it locked itself (silly me), that's why only the leader locks are successful.
I was hoping to get away with this to maintain compatibility with the consul locker and not have to rewrite the global clustering code.

I got rid of the key mapping and added the original key as an annotation to the lease, that's how the List function will be able to return the list of original keys given a prefix.
I did some tests with cluster name cluster-1 and it seems to be fine, a target lease looks like this:

Name:         gnmic-cluster-1-targets-172.20.20.2
Namespace:    gnmic
Labels:       app=gnmic
              gnmic-cluster-1-targets-172.20.20.2=gnmic-ss-2
Annotations:  original-key: gnmic/cluster-1/targets/172.20.20.2
API Version:  coordination.k8s.io/v1
Kind:         Lease
Metadata:
  Creation Timestamp:  2022-04-26T05:31:05Z
  Managed Fields:
    API Version:  coordination.k8s.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .:
          f:original-key:
        f:labels:
          .:
          f:app:
          f:gnmic-cluster-1-targets-172.20.20.2:
      f:spec:
        f:acquireTime:
        f:holderIdentity:
        f:leaseDurationSeconds:
        f:renewTime:
    Manager:         gnmic
    Operation:       Update
    Time:            2022-04-26T05:31:05Z
  Resource Version:  1876693
  UID:               ea0e4259-b39a-47f2-a62a-60dfb64cccb1
Spec:
  Acquire Time:            2022-04-26T05:39:53.085031Z
  Holder Identity:         gnmic-ss-2
  Lease Duration Seconds:  10
  Renew Time:              2022-04-26T05:39:53.085031Z
Events:                    <none>

I will issue a release shortly with this code so you can test it (if you don't mid)

Seems to be fine. Works with both cluster name and targets having - in them.

The targets not being redistributed if the statefulset is scaled up does not currently work as you said is quite an important feature but that is out of scope for this issue.

Thanks for testing it, I will write some docs about k8s based clustering before releasing.

Concerning redistribution, I think this can be done periodically (enabled via a knob redistribution-interval: 5m for e.g)
or triggered by an API request to the leader.
If you are interested in this, please open another issue we can follow it up there.