Emqx core node in error

Question

Emqx core node in error

mariusstaicu opened this issue 2 years ago · 9 comments

Describe the bug
Emqx core node fails to start on k8s after deletion and recreation with persistent volume. The goal was to deploy a broker that preserves the settings (dashboard users, mqtt clients auth & authz) even if the EMQX respource is deleted.

To Reproduce
`1. Install emqx-operator via Helm in k8s
2. Deploy a basic emqx broker with volumeClaimTemplate
4. Wait for it to start, change login via dashboard and change password.l
5. delete EMQX resource and recreate it

One of three core nodes doesn't start and is in error.

Expected behavior
All the pods start, both core and replicants nodes.

Anything else we need to know?:
emqx yaml file:

apiVersion: apps.emqx.io/v2alpha1
kind: EMQX
metadata:
  name: emqx
  namespace: emqx-operator-system
spec:
  image: emqx/emqx:5.0.8
  coreTemplate:
    metadata:
      name: emqx-core
      labels:
        apps.emqx.io/instance: emqx
        apps.emqx.io/db-role: core
    spec:
      replicas: 3
      volumeClaimTemplates:
        storageClassName: longhorn-single-replica
        resources:
          requests:
            storage: 512Mi
        accessModes:
          - ReadWriteOnce
      livenessProbe:
        httpGet:
          path: /status
          port: 18083
        initialDelaySeconds: 60
        periodSeconds: 30
        failureThreshold: 3
      readinessProbe:
        httpGet:
          path: /status
          port: 18083
        initialDelaySeconds: 10
        periodSeconds: 5
        failureThreshold: 12
      podSecurityContext:
        runAsUser: 1000
        runAsGroup: 1000
        fsGroup: 1000
        fsGroupChangePolicy: Always
      containerSecurityContext:
        runAsUser: 1000
        runAsGroup: 1000
      lifecycle:
        preStop:
          exec:
            command: [ "/bin/sh","-c","emqx ctl cluster leave" ]

Environment details::

Kubernetes version: v1.21.10
Cloud-provider/provisioner: hosted
emqx-operator version: 2.0.1
Install method: helm

Logs:

dashboard.bootstrap_users_file = EMQX_DASHBOARD__BOOTSTRAP_USERS_FILE = "/opt/emqx/data/bootstrap_user"                                                                                            
rpc.port_discovery = EMQX_RPC__PORT_DISCOVERY = manual                                                                                                                                             
log.file_handlers.default.enable = EMQX_LOG__FILE_HANDLERS__DEFAULT__ENABLE = false                                                                                                                
log.console_handler.enable = EMQX_LOG__CONSOLE_HANDLER__ENABLE = true                                                                                                                              
node.db_role = EMQX_NODE__DB_ROLE = core                                                                                                                                                           
node.name = EMQX_NODE__NAME = emqx@emqx-core-0.emqx-headless.emqx-operator-system.svc.cluster.local                                                                                                
2022-10-18T11:03:36.662853+00:00 [error] Mnesia('emqx@emqx-core-0.emqx-headless.emqx-operator-system.svc.cluster.local'): ** ERROR ** mnesia_event got {inconsistent_database, starting_partitioned
_network, 'emqx@emqx-core-1.emqx-headless.emqx-operator-system.svc.cluster.local'}                                                                                                                 
2022-10-18T11:03:36.663234+00:00 [error] Mnesia('emqx@emqx-core-0.emqx-headless.emqx-operator-system.svc.cluster.local'): ** ERROR ** mnesia_event got {inconsistent_database, starting_partitioned
_network, 'emqx@emqx-core-2.emqx-headless.emqx-operator-system.svc.cluster.local'}                                                                                                                 
2022-10-18T11:03:36.854583+00:00 [error] Mnesia('emqx@emqx-core-0.emqx-headless.emqx-operator-system.svc.cluster.local'): ** ERROR ** (core dumped to file: "/opt/emqx/MnesiaCore.emqx@emqx-core-0.
emqx-headless.emqx-operator-system.svc.cluster.local_1666_91016_807092"), ** FATAL ** Failed to merge schema: Incompatible schema cookies. Please, restart from old backup.'emqx@emqx-core-1.emqx-h
eadless.emqx-operator-system.svc.cluster.local' = [{name,schema},{type,set},{ram_copies,[]},{disc_copies,['emqx@emqx-core-2.emqx-headless.emqx-operator-system.svc.cluster.local','emqx@emqx-core-1
.emqx-headless.emqx-operator-system.svc.cluster.local']},{disc_only_copies,[]},{load_order,0},{access_mode,read_write},{majority,false},{index,[]},{snmp,[]},{local_content,false},{record_name,sch
ema},{attributes,[table,cstruct]},{user_properties,[{mnesia_backend_types,[{rocksdb_copies,mnesia_rocksdb},{null_copies,mria_mnesia_null_storage}]}]},{frag_properties,[]},{storage_properties,[]},
{cookie,{{1666087950880376159,-576460752303423101,1},'emqx@emqx-core-2.emqx-headless.emqx-operator-system.svc.cluster.local'}},{version,{{7,0},{'emqx@emqx-core-1.emqx-headless.emqx-operator-syste
m.svc.cluster.local',{1666,88190,221401}}}}], 'emqx@emqx-core-0.emqx-headless.emqx-operator-system.svc.cluster.local' = [{name,schema},{type,set},{ram_copies,[]},{disc_copies,['emqx@emqx-core-2.e
mqx-headless.emqx-operator-system.svc.cluster.local','emqx@emqx-core-1.emqx-headless.emqx-operator-system.svc.cluster.local','emqx@emqx-core-0.emqx-headless.emqx-operator-system.svc.cluster.local
']},{disc_only_copies,[]},{load_order,0},{access_mode,read_write},{majority,false},{index,[]},{snmp,[]},{local_content,false},{record_name,schema},{attributes,[table,cstruct]},{user_properties,[{
mnesia_backend_types,[{rocksdb_copies,mnesia_rocksdb},{null_copies,mria_mnesia_null_storage}]}]},{frag_properties,[]},{storage_properties,[]},{cookie,{{1666086344331374145,-576460752303423202,1},
'emqx@emqx-core-0.emqx-headless.emqx-operator-system.svc.cluster.local'}},{version,{{8,0},{'emqx@emqx-core-2.emqx-headless.emqx-operator-system.svc.cluster.local',{1666,86350,531401}}}}]         
2022-10-18T11:03:46.809511+00:00 [error] Generic server mnesia_subscr terminating. Reason: killed. Last message: {'EXIT',<0.1830.0>,killed}. State: {state,<0.1830.0>,#Ref<0.1684304479.1029832724.
194029>}.                                                                                                                                                                                          
2022-10-18T11:03:46.809864+00:00 [error] Generic server mnesia_monitor terminating. Reason: killed. Last message: {'EXIT',<0.1830.0>,killed}. State: {state,<0.1830.0>,[],[],true,[],undefined,[],[
]}.                                                                                                                                                                                                
2022-10-18T11:03:46.809510+00:00 [error] Generic server mnesia_recover terminating. Reason: killed. Last message: {'EXIT',<0.1830.0>,killed}. State: {state,<0.1830.0>,undefined,undefined,undefine
d,0,false,true,[]}.                                                                                                                                                                                
2022-10-18T11:03:46.811065+00:00 [error] crasher: initial call: mnesia_subscr:init/1, pid: <0.1832.0>, registered_name: mnesia_subscr, exit: {killed,[{gen_server,decode_msg,9,[{file,"gen_server.e
rl"},{line,481}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,226}]}]}, ancestors: [mnesia_kernel_sup,mnesia_sup,<0.1826.0>], message_queue_len: 0, messages: [], links: [], dictionar
y: [], trap_exit: true, status: running, heap_size: 2586, stack_size: 29, reductions: 6297; neighbours:                                                                                            
2022-10-18T11:03:46.810244+00:00 [error] crasher: initial call: application_master:init/4, pid: <0.1825.0>, registered_name: [], exit: {{normal,{mnesia_app,start,[normal,]]}},[{application_master
,init,4,[{file,"application_master.erl"},{line,142}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,226}]}]}, ancestors: [<0.1824.0>], message_queue_len: 1, messages: [{'EXIT',<0.1826.
0>,normal}], links: [<0.1824.0>,<0.1685.0>], dictionary: [], trap_exit: true, status: running, heap_size: 376, stack_size: 29, reductions: 167; neighbours:                                        
2022-10-18T11:03:46.811386+00:00 [error] crasher: initial call: gen_event:init_it/6, pid: <0.1828.0>, registered_name: mnesia_event, exit: {killed,[{gen_event,terminate_server,4,[{file,"gen_event
.erl"},{line,405}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,226}]}]}, ancestors: [mnesia_sup,<0.1826.0>], message_queue_len: 1, messages: [{notify,{mnesia_system_event,{mnesia_do
wn,'emqx@emqx-core-0.emqx-headless.emqx-operator-system.svc.cluster.local'}}}], links: [], dictionary: [], trap_exit: true, status: running, heap_size: 10958, stack_size: 29, reductions: 27853; n
eighbours:                                                                                                                                                                                         
2022-10-18T11:03:46.813591+00:00 [error] crasher: initial call: mnesia_monitor:init/1, pid: <0.1831.0>, registered_name: mnesia_monitor, exit: {killed,[{gen_server,decode_msg,9,[{file,"gen_server
.erl"},{line,481}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,226}]}]}, ancestors: [mnesia_kernel_sup,mnesia_sup,<0.1826.0>], message_queue_len: 0, messages: [], links: [<53724.183
1.0>,<53725.3553.0>,<0.1873.0>], dictionary: [], trap_exit: true, status: running, heap_size: 6772, stack_size: 29, reductions: 13705; neighbours:                                                 
2022-10-18T11:03:46.813855+00:00 [error] crasher: initial call: application_master:init/4, pid: <0.1816.0>, registered_name: [], exit: {{bad_return,{{mria_app,start,[normal,]]}

Answer 1 · 2022-10-19T01:29:20.000Z

Hi, @mariusstaicu has your PVC been pre-populated with historical data prior to EMQX deployment?

Answer 2 · 2022-10-19T06:08:10.000Z

It might be caused by inconsistent schemas (on different nodes) that the mnesia don't know how to merge.

The schema on node 'emqx@emqx-core-1.*' says it has data syncing with emqx@emqx-core-2.* and emqx@emqx-core-1.*, but the other schema on emqx@emqx-core-0.* says it has data syncing with emqx@emqx-core-2, emqx@emqx-core-1, and emqx@emqx-core-0.

It might because you have had built a cluster only has nodes core-1 and core-2, persisted the stale data/mneisa/ dir (on node core-1) and try to load it with 3 nodes. Please delete the data/mnesia on core-1 and try again.

Answer 3 · 2022-10-19T06:55:19.000Z

The cluster has 3 nodes in first deployment and also in second. I do not make cluster topology changes manually, I let the operator provision all. Is there a way to make this work without manually deleting mnesia dir ?

Answer 4 · 2022-10-19T08:04:20.000Z

Hi, @mariusstaicu has your PVC been pre-populated with historical data prior to EMQX deployment?

No, it just contains the data from emqx normal deployment, no other modifications done. So deploy EMQX resource, make some changes like changing the dashboard password, delete EMQX resource and recreate again while keeping the PVCs which is the default behaviour.

Answer 5 · 2022-10-20T02:36:25.000Z

Hi, @mariusstaicu could you please delete .spec.coreTeamplte.spec.lifecycle retry all the steps?

Answer 6 · 2022-10-27T06:07:21.000Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Answer 7 · 2022-10-27T14:47:35.000Z

Hi @Rory-Z I will try.

Answer 8 · 2022-10-28T08:51:55.000Z

Hi I can confirm that the problem doesn't happen anymore when I remove the lifecycle.

Answer 9 · 2022-10-28T08:59:55.000Z

Hi I can confirm that the problem doesn't happen anymore when I remove the lifecycle.

Cool.
I will close this issue, if you have any questions you can always open it