vernemq/docker-vernemq

CRASH REPORT with VerneMQ Helm Install

jensjohansen opened this issue ยท 5 comments

Crash Report:

15:23:11.431 [error] CRASH REPORT Process <0.844.0> with 0 neighbours crashed with reason: bad argument in call to erlang:binary_to_list(undefined) in ssl_config:file_error/2 line 221
15:23:11.431 [error] Supervisor {<0.842.0>,tls_dyn_connection_sup} had child receiver started with {ssl_gen_statem,start_link,undefined} at <0.844.0> exit with reason bad argument in call to erlang:binary_to_list(undefined) in ssl_config:file_error/2 line 221 in context child_terminated
15:23:11.431 [error] Supervisor {<0.449.0>,ranch_acceptors_sup} had child {acceptor,<0.449.0>,1} started with ranch_acceptor:start_link({{192,168,40,176},8443}, 1, {sslsocket,nil,{#Port<0.15>,{config,#{middlebox_comp_mode => true,padding_check => true,signature_algs => ...,...},...}}}, ranch_ssl, logger) at <0.711.0> exit with reason bad argument in call to erlang:binary_to_list(undefined) in ssl_config:file_error/2 line 221 in context child_terminated

Installation: VerneMQ Helm Chart version 1.8.0

cert-manager Certificate CRD (verified a valid Letsencrypt cert is stored in mqtt-link-labs-tls-secret):

apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: mqtt-link-labs-tls-certificate
  namespace: mqtt-publishing
spec:
  secretName: mqtt-link-labs-tls-secret
  dnsNames:
  - "dev-mqtt.amz-link-labs.net"
  subject:
    organizations:
      - "Link Labs"
    organizationalUnits:
      - "Airfinder Asset RTLS"
  issuerRef:
    name: letsencrypt-production
    kind: ClusterIssuer

values.yaml

additionalEnv:
- name: DOCKER_VERNEMQ_ALLOW_ANONYMOUS
  value: "off"
- name: DOCKER_VERNEMQ_LISTENER__SSL__CAFILE
  value: /etc/ssl/vernemq/tls.crt
- name: DOCKER_VERNEMQ_LISTENER__SSL__CERTFILE
  value: /etc/ssl/vernemq/tls.crt
- name: DOCKER_VERNEMQ_LISTENER__SSL__KEYFILE
  value: /etc/ssl/vernemq/tls.key
- name: DOCKER_VERNEMQ_METADATA_PLUGIN
  value: vmq_plumtree
- name: DOCKER_VERNEMQ_PERSISTENT_CLIENT_EXPIRATION
  value: 1d
- name: DOCKER_VERNEMQ_ACCEPT_EULA
  value: "yes"
- name: DOCKER_VERNEMQ_PLUGINS__VMQ_PASSWD
  value: "off"
- name: DOCKER_VERNEMQ_PLUGINS__VMQ_ACL
  value: "off"
- name: DOCKER_VERNEMQ_PLUGINS__VMQ_DIVERSITY
  value: "on"
- name: DOCKER_VERNEMQ_VMQ_DIVERSITY__AUTH_MYSQL__ENABLED
  value: "on"
- name: DOCKER_VERNEMQ_VMQ_DIVERSITY__MYSQL__HOST
  value: <redacted>
- name: DOCKER_VERNEMQ_VMQ_DIVERSITY__MYSQL__PORT
  value: "3306"
- name: DOCKER_VERNEMQ_VMQ_DIVERSITY__MYSQL__USER
  value: vmq
- name: DOCKER_VERNEMQ_VMQ_DIVERSITY__MYSQL__PASSWORD
  value: <redacted>
- name: DOCKER_VERNEMQ_VMQ_DIVERSITY__MYSQL__DATABASE
  value: access
- name: DOCKER_VERNEMQ_VMQ_DIVERSITY__MYSQL__PASSWORD_HASH_METHOD
  value: sha256
- name: DOCKER_VERNEMQ_LOG__CONSOLE
  value: both
- name: DOCKER_VERNEMQ_LOG__CONSOLE__LEVEL
  value: debug
- name: DOCKER_VERNEMQ_TOPIC_MAX_DEPTH
  value: "20"
envFrom: []
extraVolumeMounts: []
extraVolumes: []
fullnameOverride: ""
image:
  pullPolicy: IfNotPresent
  repository: vernemq/vernemq
  tag: latest
ingress:
  annotations: {}
  className: ""
  enabled: false
  hosts: []
  labels: {}
  paths:
  - path: /
    pathType: ImplementationSpecific
  tls: []
nameOverride: ""
nodeSelector: {}
pdb:
  enabled: false
  minAvailable: 1
persistentVolume:
  accessModes:
  - ReadWriteOnce
  annotations: {}
  enabled: false
  size: 50Gi
podAntiAffinity: soft
rbac:
  create: true
  serviceAccount:
    create: true
replicaCount: 3
resources: {}
secretMounts:
- name: vernemq-certificate
  path: /etc/ssl/vernemq
  secretName: mqtt-link-labs-tls-secret
securityContext:
  fsGroup: 10000
  runAsGroup: 10000
  runAsUser: 10000
service:
  annotations:
    alb.ingress.kubernetes.io/scheme: internet-facing
    alb.ingress.kubernetes.io/target-type: ip
    service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing
    service.beta.kubernetes.io/aws-load-balancer-type: nlb
  api:
    enabled: true
    nodePort: 38888
    port: 8888
  enabled: true
  labels: {}
  loadBalancerSourceRanges:
  - <redacted>
  mqtt:
    enabled: true
    nodePort: 1883
    port: 1883
  mqtts:
    enabled: true
    nodePort: 8883
    port: 8883
  type: LoadBalancer
  ws:
    enabled: true
    nodePort: 8080
    port: 8080
  wss:
    enabled: true
    nodePort: 8443
    port: 8443
serviceMonitor:
  create: true
  labels: {}
statefulset:
  annotations: {}
  labels: {}
  lifecycle: {}
  livenessProbe:
    failureThreshold: 3
    initialDelaySeconds: 90
    periodSeconds: 10
    successThreshold: 1
    timeoutSeconds: 5
  podAnnotations: {}
  podLabels: {}
  podManagementPolicy: OrderedReady
  readinessProbe:
    failureThreshold: 3
    initialDelaySeconds: 90
    periodSeconds: 10
    successThreshold: 1
    timeoutSeconds: 5
  terminationGracePeriodSeconds: 60
  updateStrategy: RollingUpdate
tolerations: []

Expected Behavior:
MQTT Explorer can connect to, and browse topics in

  • mqtt://dev.mqtt.amz-link-labs.net:1883
  • mqtts://dev-mqtt.amz-link-labs.net:8883
  • ws://dev-mqtt.amz-link-labs.net:8080
  • wss//dev-mqtt.amz-link-labs.net:8443

API available on https://dev-mqtt.amz-link-labs.net:8888

Actual Behavior

  • All three replicas endlessly repeating the above Crash Report.
  • Connections from MQTT explorer timing out.
  • Connection to API times out

@jensjohansen You're using the same file as certfile and cafile. I'm not 100% sure this is impossible but you might want to check if that's an issue.


๐Ÿ‘‰ Thank you for supporting VerneMQ: https://github.com/sponsors/vernemq
๐Ÿ‘‰ Using the binary VerneMQ packages commercially (.deb/.rpm/Docker) requires a paid subscription.

I will work on that. The HELM install instructions say that cert-manager inserts the ca sert into the tls cert, and I see that, but they also say the secret is supposed to have three keys internally (ca.crt, tls.cert and tls.key), but the current version of cert-manager seems to create a secret with all three items in one key.

Still an issue. Even removing these still causes the problem.

- name: DOCKER_VERNEMQ_LISTENER__SSL__CAFILE
  value: /etc/ssl/vernemq/tls.crt
- name: DOCKER_VERNEMQ_LISTENER__SSL__CERTFILE
  value: /etc/ssl/vernemq/tls.crt
- name: DOCKER_VERNEMQ_LISTENER__SSL__KEYFILE
  value: /etc/ssl/vernemq/tls.key

In the Helm instructions, the documentation says that the cert-manager inserts the ca.crt in the secret along with the tls.crt and tls.crt, and shows the expected format of the secret. However, when you create the certificate using the Certificate CRD that used letsencrypt, the resulting secret has only two keys: tls.crt and tls.key. There are actually three certs in tls.key - the authorization chain up to letsencrypt-production. I have tried extracting the ca.crt out into a separate key following the pattern suggested for using existing keys, but this still gets exactly the same results:

15:01:53.704 [error] Supervisor {<0.2120.0>,tls_dyn_connection_sup} had child receiver started with {ssl_gen_statem,start_link,undefined} at <0.2122.0> exit with reason bad argument in call to erlang:binary_to_list(undefined) in ssl_config:file_error/2 line 221 in context child_terminated
15:01:53.705 [error] Supervisor {<0.401.0>,ranch_acceptors_sup} had child {acceptor,<0.401.0>,2} started with ranch_acceptor:start_link({{192,168,48,156},8883}, 2, {sslsocket,nil,{#Port<0.13>,{config,#{versions => [{3,3}],keyfile => <<>>,next_protocol_selector => ...,...},...}}}, ranch_ssl, logger) at <0.2014.0> exit with reason bad argument in call to erlang:binary_to_list(undefined) in ssl_config:file_error/2 line 221 in context child_terminated
15:01:54.005 [debug] session normally stopped
15:01:58.198 [debug] session normally stopped
15:01:59.768 [error] CRASH REPORT Process <0.2135.0> with 0 neighbours crashed with reason: bad argument in call to erlang:binary_to_list(undefined) in ssl_config:file_error/2 line 221
15:01:59.768 [error] Supervisor {<0.2133.0>,tls_dyn_connection_sup} had child receiver started with {ssl_gen_statem,start_link,undefined} at <0.2135.0> exit with reason bad argument in call to erlang:binary_to_list(undefined) in ssl_config:file_error/2 line 221 in context child_terminated
15:01:59.768 [error] Supervisor {<0.401.0>,ranch_acceptors_sup} had child {acceptor,<0.401.0>,3} started with ranch_acceptor:start_link({{192,168,48,156},8883}, 3, {sslsocket,nil,{#Port<0.13>,{config,#{versions => [{3,3}],keyfile => <<>>,next_protocol_selector => ...,...},...}}}, ranch_ssl, logger) at <0.2027.0> exit with reason bad argument in call to erlang:binary_to_list(undefined) in ssl_config:file_error/2 line 221 in context child_terminated
15:02:01.740 [error] CRASH REPORT Process <0.2140.0> with 0 neighbours crashed with reason: bad argument in call to erlang:binary_to_list(undefined) in ssl_config:file_error/2 line 221
15:02:01.741 [error] Supervisor {<0.2138.0>,tls_dyn_connection_sup} had child receiver started with {ssl_gen_statem,start_link,undefined} at <0.2140.0> exit with reason bad argument in call to erlang:binary_to_list(undefined) in ssl_config:file_error/2 line 221 in context child_terminated
15:02:01.741 [error] Supervisor {<0.401.0>,ranch_acceptors_sup} had child {acceptor,<0.401.0>,4} started with ranch_acceptor:start_link({{192,168,48,156},8883}, 4, {sslsocket,nil,{#Port<0.13>,{config,#{versions => [{3,3}],keyfile => <<>>,next_protocol_selector => ...,...},...}}}, ranch_ssl, logger) at <0.2038.0> exit with reason bad argument in call to erlang:binary_to_list(undefined) in ssl_config:file_error/2 line 221 in context child_terminated
15:02:02.324 [debug] started plumtree_metadata_manager exchange with 'VerneMQ@vernemq-1.vernemq-headless.mqtt-publishing.svc.cluster.local' (<0.2143.0>)
15:02:02.327 [debug] completed metadata exchange with 'VerneMQ@vernemq-1.vernemq-headless.mqtt-publishing.svc.cluster.local'. nothing repaired
15:02:02.329 [debug] 0ms mailbox traversal, schedule next lazy broadcast in 10000ms, the min interval is 10000ms
15:02:02.478 [debug] session normally stopped
15:02:05.596 [error] CRASH REPORT Process <0.2154.0> with 0 neighbours crashed with reason: bad argument in call to erlang:binary_to_list(undefined) in ssl_config:file_error/2 line 221
15:02:05.596 [error] Supervisor {<0.2152.0>,tls_dyn_connection_sup} had child receiver started with {ssl_gen_statem,start_link,undefined} at <0.2154.0> exit with reason bad argument in call to erlang:binary_to_list(undefined) in ssl_config:file_error/2 line 221 in context child_terminated
15:02:05.597 [error] Supervisor {<0.401.0>,ranch_acceptors_sup} had child {acceptor,<0.401.0>,5} started with ranch_acceptor:start_link({{192,168,48,156},8883}, 5, {sslsocket,nil,{#Port<0.13>,{config,#{versions => [{3,3}],keyfile => <<>>,next_protocol_selector => ...,...},...}}}, ranch_ssl, logger) at <0.2051.0> exit with reason bad argument in call to erlang:binary_to_list(undefined) in ssl_config:file_error/2 line 221 in context child_terminated
15:02:07.093 [debug] session normally stopped
15:02:08.320 [debug] session normally stopped
15:02:09.830 [error] CRASH REPORT Process <0.2166.0> with 0 neighbours crashed with reason: bad argument in call to erlang:binary_to_list(undefined) in ssl_config:file_error/2 line 221
15:02:09.830 [error] Supervisor {<0.2164.0>,tls_dyn_connection_sup} had child receiver started with {ssl_gen_statem,start_link,undefined} at <0.2166.0> exit with reason bad argument in call to erlang:binary_to_list(undefined) in ssl_config:file_error/2 line 221 in context child_terminated
15:02:09.830 [error] Supervisor {<0.401.0>,ranch_acceptors_sup} had child {acceptor,<0.401.0>,6} started with ranch_acceptor:start_link({{192,168,48,156},8883}, 6, {sslsocket,nil,{#Port<0.13>,{config,#{versions => [{3,3}],keyfile => <<>>,next_protocol_selector => ...,...},...}}}, ranch_ssl, logger) at <0.2056.0> exit with reason bad argument in call to erlang:binary_to_list(undefined) in ssl_config:file_error/2 line 221 in context child_terminated
15:02:11.923 [error] CRASH REPORT Process <0.2171.0> with 0 neighbours crashed with reason: bad argument in call to erlang:binary_to_list(undefined) in ssl_config:file_error/2 line 221
15:02:11.923 [error] Supervisor {<0.2169.0>,tls_dyn_connection_sup} had child receiver started with {ssl_gen_statem,start_link,undefined} at <0.2171.0> exit with reason bad argument in call to erlang:binary_to_list(undefined) in ssl_config:file_error/2 line 221 in context child_terminated
15:02:11.924 [error] Supervisor {<0.401.0>,ranch_acceptors_sup} had child {acceptor,<0.401.0>,7} started with ranch_acceptor:start_link({{192,168,48,156},8883}, 7, {sslsocket,nil,{#Port<0.13>,{config,#{versions => [{3,3}],keyfile => <<>>,next_protocol_selector => ...,...},...}}}, ranch_ssl, logger) at <0.2069.0> exit with reason bad argument in call to erlang:binary_to_list(undefined) in ssl_config:file_error/2 line 221 in context child_terminated
15:02:12.325 [debug] started plumtree_metadata_manager exchange with 'VerneMQ@vernemq-0.vernemq-headless.mqtt-publishing.svc.cluster.local' (<0.2173.0>)
15:02:12.326 [debug] completed metadata exchange with 'VerneMQ@vernemq-0.vernemq-headless.mqtt-publishing.svc.cluster.local'. nothing repaired
15:02:12.330 [debug] 0ms mailbox traversal, schedule next lazy broadcast in 10000ms, the min interval is 10000ms
15:02:13.293 [debug] session normally stopped
15:02:16.874 [error] CRASH REPORT Process <0.2184.0> with 0 neighbours crashed with reason: bad argument in call to erlang:binary_to_list(undefined) in ssl_config:file_error/2 line 221
15:02:16.874 [error] Supervisor {<0.2182.0>,tls_dyn_connection_sup} had child receiver started with {ssl_gen_statem,start_link,undefined} at <0.2184.0> exit with reason bad argument in call to erlang:binary_to_list(undefined) in ssl_config:file_error/2 line 221 in context child_terminated
15:02:16.875 [error] Supervisor {<0.401.0>,ranch_acceptors_sup} had child {acceptor,<0.401.0>,8} started with ranch_acceptor:start_link({{192,168,48,156},8883}, 8, {sslsocket,nil,{#Port<0.13>,{config,#{versions => [{3,3}],keyfile => <<>>,next_protocol_selector => ...,...},...}}}, ranch_ssl, logger) at <0.2081.0> exit with reason bad argument in call to erlang:binary_to_list(undefined) in ssl_config:file_error/2 line 221 in context child_terminated
15:02:18.161 [debug] session normally stopped
15:02:18.443 [debug] session normally stopped

Even a hint as to what the line 221 error means would help. I am guessing I will have to dig into the code next to make progress.

Solution:

Contrary to the documentation, cert-manager doesn't add the CA cert to the certificate file.

To stop the CRASH Reports using the cert-manager 1.9.1 and later:

  1. Create an account LetsEncrypt based on an email used by your devops team at https://community.letsencrypt.org/
  2. Create a ClusterIssuer to use this account
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-production
spec:
  acme:
    email: <your devops email here>
    preferredChain: ''
    privateKeySecretRef:
      name: letsencrypt-production
    server: https://acme-v02.api.letsencrypt.org/directory
    solvers:
      - http01:
          ingress:
            class: nginx-external   #change this to your internet-facing ingress controller's ingressClassName

using your DevOps email you set up as your LetsEncrypt account, and the name of your internet-accessible ingress controller.

  1. Create a certificate secret for your helm chart to use:
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: mqtt-tls-certificate
  namespace: mqtt-publishing   # In case you are using a namespace for vernemq other than default
spec:
  dnsNames:
    - dev-mqtt.amz-link-labs.net
  issuerRef:
    group: cert-manager.io
    kind: ClusterIssuer
    name: letsencrypt-production
  secretName: mqtt-tls-secret   # this is the secret where the tls.crrt and tls.key will be stored
  usages:
    - digital signature
    - key encipherment
  1. Install VerneMQ (I used Helm Chart version 1.8.0) using something like these values (values.yaml)
additionalEnv:
# note, specifying the ca.crt causes the Crash Report, so leave it out for letsencrypt certs
# - name: DOCKER_VERNEMQ_LISTENER__SSL__CAFILE
#   value: /etc/ssl/vernemq/tls.crt
- name: DOCKER_VERNEMQ_LISTENER__SSL__CERTFILE   
  value: /etc/ssl/vernemq/tls.crt
- name: DOCKER_VERNEMQ_LISTENER__SSL__KEYFILE
  value: /etc/ssl/vernemq/tls.key
- name: DOCKER_VERNEMQ_ALLOW_REGISTER_DURING_NETSPLIT
  value: "on"
- name: DOCKER_VERNEMQ_ALLOW_PUBLISH_DURING_NETSPLIT
  value: "on"
- name: DOCKER_VERNEMQ_ALLOW_SUBSCRIBE_DURING_NETSPLIT
  value: "on"
- name: DOCKER_VERNEMQ_ALLOW_UNSUBSCRIBE_DURING_NETSPLIT
  value: "on"
- name: DOCKER_VERNEMQ_ALLOW_ANONYMOUS
  value: "off"
- name: DOCKER_VERNEMQ_METADATA_PLUGIN
  value: vmq_plumtree
- name: DOCKER_VERNEMQ_PERSISTENT_CLIENT_EXPIRATION
  value: 1d
- name: DOCKER_VERNEMQ_ACCEPT_EULA
  value: "yes"
- name: DOCKER_VERNEMQ_PLUGINS__VMQ_PASSWD
  value: "off"
- name: DOCKER_VERNEMQ_PLUGINS__VMQ_ACL
  value: "off"
- name: DOCKER_VERNEMQ_PLUGINS__VMQ_DIVERSITY
  value: "on"
- name: DOCKER_VERNEMQ_VMQ_DIVERSITY__AUTH_MYSQL__ENABLED
  value: "on"
- name: DOCKER_VERNEMQ_VMQ_DIVERSITY__MYSQL__HOST
  value: <redacted>
- name: DOCKER_VERNEMQ_VMQ_DIVERSITY__MYSQL__PORT
  value: "3306"
- name: DOCKER_VERNEMQ_VMQ_DIVERSITY__MYSQL__USER
  value: vmq
- name: DOCKER_VERNEMQ_VMQ_DIVERSITY__MYSQL__PASSWORD
  value: <redacted>
- name: DOCKER_VERNEMQ_VMQ_DIVERSITY__MYSQL__DATABASE
  value: access
- name: DOCKER_VERNEMQ_VMQ_DIVERSITY__MYSQL__PASSWORD_HASH_METHOD
  value: sha256
- name: DOCKER_VERNEMQ_LOG__CONSOLE
  value: both
- name: DOCKER_VERNEMQ_LOG__CONSOLE__LEVEL
  value: debug
- name: DOCKER_VERNEMQ_TOPIC_MAX_DEPTH
  value: "20"
envFrom: []
extraVolumeMounts: []
extraVolumes: []
fullnameOverride: ""
image:
  pullPolicy: IfNotPresent
  repository: vernemq/vernemq
  tag: 1.12.6.1-alpine
ingress:
  annotations:
    app.kubernetes.io/name: vernemq
    # acme.cert-manager is the cert-manager plugin that handles letsencrypt negotiation
    # tell Acme what ingress controller is internet-facing certmanager can automate the Letsencrypt challenges
    acme.cert-manager.io/http01-ingress-class: nginx-external
    acme.cert-manager.io/http01-edit-in-place: "true" # reuse existing valid certs
    alb.ingress.kubernetes.io/scheme: internet-facing
    cert-manager.io/cluster-issuer: letsencrypt-production
    certmanager.k8s.io/acme-challenge-type: http01    #use HTTP rather than DNS challenges
  className: nginx-external    #make mqtt available on an internet-facing ingress controller
  enabled: true
  hosts:
  - dev-mqtt.amz-link-labs.net
  labels: {}
  paths:
  - path: /
    pathType: ImplementationSpecific
  tls:
  - hosts:
    - dev-mqtt.amz-link-labs.net
    secretName: mqtt-tls-secret
nameOverride: ""
nodeSelector: {}
pdb:
  enabled: false
  minAvailable: 1
persistentVolume:
  accessModes:
  - ReadWriteOnce
  annotations: {}
  enabled: true
  size: 50Gi
podAntiAffinity: soft
rbac:
  create: true
  serviceAccount:
    create: true
replicaCount: 3
resources: {}
secretMounts:
- name: vernemq-certificates
  path: /etc/ssl/vernemq
  secretName: mqtt-tls-secret    # the secret created in step 
securityContext:
  fsGroup: 10000
  runAsGroup: 10000
  runAsUser: 10000
service:
  annotations: {}
  api:
    enabled: true
    nodePort: 38888
    port: 8888
  enabled: true
  labels: {}
  mqtt:
    enabled: true
    nodePort: 1883
    port: 1883
  mqtts:
    enabled: true
    nodePort: 8883
    port: 8883
  type: ClusterIP
  ws:
    enabled: true
    nodePort: 8080
    port: 8080
  wss:
    enabled: true
    nodePort: 8443
    port: 8443
serviceMonitor:
  create: true
  labels: {}
statefulset:
  annotations: {}
  labels: {}
  lifecycle: {}
  livenessProbe:
    failureThreshold: 3
    initialDelaySeconds: 90
    periodSeconds: 10
    successThreshold: 1
    timeoutSeconds: 5
  podAnnotations: {}
  podLabels: {}
  podManagementPolicy: OrderedReady
  readinessProbe:
    failureThreshold: 3
    initialDelaySeconds: 90
    periodSeconds: 10
    successThreshold: 1
    timeoutSeconds: 5
  terminationGracePeriodSeconds: 60
  updateStrategy: RollingUpdate
tolerations: []

@jensjohansen Thank you for documenting your solution! so is there anything wrong or incomplete in our documentation still?


๐Ÿ‘‰ Thank you for supporting VerneMQ: https://github.com/sponsors/vernemq
๐Ÿ‘‰ Using the binary VerneMQ packages commercially (.deb/.rpm/Docker) requires a paid subscription.