emqx/emqx-operator

Replicant pod can not join the cluster

memoliyasti opened this issue · 6 comments

We are running EMQx v5.0.16 in our on-prem Kubernetes clusters with 3 cores and 3 replicants.
 
What we see from time to time is that the 1 or 2 of the 3 EMQx replicants are unable to join the cluster.
In such a situation, those 1 or 2 pods are in running state but EMQx is not in a functional state because it could not join the cluster.
 
As a result, the EMQx listeners service still has all three replicants as its endpoints.
On connecting to the EMQx dashboard, we see a lot less connections and then we see only 4 nodes (3 cores, and 1 replicant).
 
Thus, most of the connections from our clients to the MQTT broker fails as many of those requests are forwarded by the EQMX listeners service to the faulty pods as well.
 
 
Our question:
When 1 or 2 EMQx replicants, in our case, could not join the cluster, why didnt EMQx service fail ? If that would have happened, "restartPolicy" for the container would kick-in and the container would be restarted.
On a restart, the replicant could possibly join the cluster and be functional.
 
We have seen that in our situation, when we manually deleted both the faulty pods, they got recreated and then could successfully join the cluster.
 
 
It appears that the faulty pods stay in that state forever, until we manually delete them.
 
Is this functionality not built-in ? If not, could we request such a functionality ?
 
Or, if such a functionality already exists, how can we use it ? Is there a specific configuration that needs to be enabled for this to work ?

Environment details::

  • Kubernetes version: v1.24.9
  • Cloud-provider/provisioner: on-prem cluster
  • emqx-operator version: 2.1.4
  • Install method: e.g. helm
Rory-Z commented

Hi, @memoliyasti sorry for this, this is a bug. For service, we just check the EMQX pod is ready, doesn't check is it in cluster.

Could you please try EMQX Operator 2.2.2 and EMQX 5.1 ? I think we fixed it in EMQX Operator 2.2.

Hi @Rory-Z

After using the new version you mentioned, I can see that the replicant pod that I kicked out manually joined the cluster again and got the new connections.
Thanks for your support.

I would like to hear about the EMQX Upgrade.
We have an EMQX cluster running on K8s and we would like to upgrade it from v5.0.16 to v5.1.6. Can you share your thoughts about how we can do it without downtime or min downtime?

When we upgraded the EMQX v4 to v5, we deployed new ones next to each other with different ingress and we updated the ingress endpoint and that is how we got a new connection on the new EMQX V5.

Is there any way to do this upgrade with the EMQX Open-Source version without updating the ingress?

Rory-Z commented

We have an EMQX cluster running on K8s and we would like to upgrade it from v5.0.16 to v5.1.6. Can you share your thoughts about how we can do it without downtime or min downtime?

Hi @memoliyasti Maybe you can check this: #932 (comment)

Hi @Rory-Z Thanks for support.

When I follow the procedure that you specified, the updated CR creates a new PVC even though I point to the same StorageClass. In this case, I have to set all configurations from scratch, like creating an authentication database, and dashboard user/password. Is there any way that I can use the same PVC that the old version uses? I specify below the EMQX manifest I use for patching old ones.

apiVersion: apps.emqx.io/v2beta1
kind: EMQX
metadata:
  name: emqx
  namespace: <namespace>
spec:
  image: <image>
  imagePullSecrets:
    - name: <secret>
  config:
    mode: "Merge"
    data: |
      authorization {
        no_match = "deny"
        deny_action = "disconnect"
        sources = [{type: "file", enable: true, path: "/etc/emqx/acl.conf"}]
      }
      dashboard {
        default_username: "admin"
        default_password: "public"
      }
      mqtt {
        max_mqueue_len = "7200000"
        max_inflight = "10000"
        max_packet_size = "25MB"
      }
  coreTemplate:
    spec:
      replicas: 3
      volumeClaimTemplates:
        storageClassName: <StorageClassName>
        resources:
          requests:
            storage: 4Gi
        accessModes:
          - ReadWriteOnce
  replicantTemplate:
    spec:
      replicas: 3
      resources:
        requests:
          memory: 2Gi
          cpu: 500m
        limits:
          memory: 2Gi
Rory-Z commented

Sorry, you can not use the same PVC to the different stateful set, but the good news is that you don't need to reset your configuration because once EMQX clusters successfully, it will automatically synchronize the configurations between nodes, provided that they are using the same minor version.

PS: We recommend that you modify the configuration of EMQX through config.data, so as to better manage the configuration of EMQX.

Hi @Rory-Z Thanks for help. I am closing the issue.