Emqx core node in error
mariusstaicu opened this issue · 9 comments
Describe the bug
Emqx core node fails to start on k8s after deletion and recreation with persistent volume. The goal was to deploy a broker that preserves the settings (dashboard users, mqtt clients auth & authz) even if the EMQX respource is deleted.
To Reproduce
`1. Install emqx-operator via Helm in k8s
2. Deploy a basic emqx broker with volumeClaimTemplate
4. Wait for it to start, change login via dashboard and change password.l
5. delete EMQX resource and recreate it
One of three core nodes doesn't start and is in error.
Expected behavior
All the pods start, both core and replicants nodes.
Anything else we need to know?:
emqx yaml file:
apiVersion: apps.emqx.io/v2alpha1
kind: EMQX
metadata:
name: emqx
namespace: emqx-operator-system
spec:
image: emqx/emqx:5.0.8
coreTemplate:
metadata:
name: emqx-core
labels:
apps.emqx.io/instance: emqx
apps.emqx.io/db-role: core
spec:
replicas: 3
volumeClaimTemplates:
storageClassName: longhorn-single-replica
resources:
requests:
storage: 512Mi
accessModes:
- ReadWriteOnce
livenessProbe:
httpGet:
path: /status
port: 18083
initialDelaySeconds: 60
periodSeconds: 30
failureThreshold: 3
readinessProbe:
httpGet:
path: /status
port: 18083
initialDelaySeconds: 10
periodSeconds: 5
failureThreshold: 12
podSecurityContext:
runAsUser: 1000
runAsGroup: 1000
fsGroup: 1000
fsGroupChangePolicy: Always
containerSecurityContext:
runAsUser: 1000
runAsGroup: 1000
lifecycle:
preStop:
exec:
command: [ "/bin/sh","-c","emqx ctl cluster leave" ]
Environment details::
- Kubernetes version: v1.21.10
- Cloud-provider/provisioner: hosted
- emqx-operator version: 2.0.1
- Install method: helm
Logs:
dashboard.bootstrap_users_file = EMQX_DASHBOARD__BOOTSTRAP_USERS_FILE = "/opt/emqx/data/bootstrap_user"
rpc.port_discovery = EMQX_RPC__PORT_DISCOVERY = manual
log.file_handlers.default.enable = EMQX_LOG__FILE_HANDLERS__DEFAULT__ENABLE = false
log.console_handler.enable = EMQX_LOG__CONSOLE_HANDLER__ENABLE = true
node.db_role = EMQX_NODE__DB_ROLE = core
node.name = EMQX_NODE__NAME = emqx@emqx-core-0.emqx-headless.emqx-operator-system.svc.cluster.local
2022-10-18T11:03:36.662853+00:00 [error] Mnesia('emqx@emqx-core-0.emqx-headless.emqx-operator-system.svc.cluster.local'): ** ERROR ** mnesia_event got {inconsistent_database, starting_partitioned
_network, 'emqx@emqx-core-1.emqx-headless.emqx-operator-system.svc.cluster.local'}
2022-10-18T11:03:36.663234+00:00 [error] Mnesia('emqx@emqx-core-0.emqx-headless.emqx-operator-system.svc.cluster.local'): ** ERROR ** mnesia_event got {inconsistent_database, starting_partitioned
_network, 'emqx@emqx-core-2.emqx-headless.emqx-operator-system.svc.cluster.local'}
2022-10-18T11:03:36.854583+00:00 [error] Mnesia('emqx@emqx-core-0.emqx-headless.emqx-operator-system.svc.cluster.local'): ** ERROR ** (core dumped to file: "/opt/emqx/MnesiaCore.emqx@emqx-core-0.
emqx-headless.emqx-operator-system.svc.cluster.local_1666_91016_807092"), ** FATAL ** Failed to merge schema: Incompatible schema cookies. Please, restart from old backup.'emqx@emqx-core-1.emqx-h
eadless.emqx-operator-system.svc.cluster.local' = [{name,schema},{type,set},{ram_copies,[]},{disc_copies,['emqx@emqx-core-2.emqx-headless.emqx-operator-system.svc.cluster.local','emqx@emqx-core-1
.emqx-headless.emqx-operator-system.svc.cluster.local']},{disc_only_copies,[]},{load_order,0},{access_mode,read_write},{majority,false},{index,[]},{snmp,[]},{local_content,false},{record_name,sch
ema},{attributes,[table,cstruct]},{user_properties,[{mnesia_backend_types,[{rocksdb_copies,mnesia_rocksdb},{null_copies,mria_mnesia_null_storage}]}]},{frag_properties,[]},{storage_properties,[]},
{cookie,{{1666087950880376159,-576460752303423101,1},'emqx@emqx-core-2.emqx-headless.emqx-operator-system.svc.cluster.local'}},{version,{{7,0},{'emqx@emqx-core-1.emqx-headless.emqx-operator-syste
m.svc.cluster.local',{1666,88190,221401}}}}], 'emqx@emqx-core-0.emqx-headless.emqx-operator-system.svc.cluster.local' = [{name,schema},{type,set},{ram_copies,[]},{disc_copies,['emqx@emqx-core-2.e
mqx-headless.emqx-operator-system.svc.cluster.local','emqx@emqx-core-1.emqx-headless.emqx-operator-system.svc.cluster.local','emqx@emqx-core-0.emqx-headless.emqx-operator-system.svc.cluster.local
']},{disc_only_copies,[]},{load_order,0},{access_mode,read_write},{majority,false},{index,[]},{snmp,[]},{local_content,false},{record_name,schema},{attributes,[table,cstruct]},{user_properties,[{
mnesia_backend_types,[{rocksdb_copies,mnesia_rocksdb},{null_copies,mria_mnesia_null_storage}]}]},{frag_properties,[]},{storage_properties,[]},{cookie,{{1666086344331374145,-576460752303423202,1},
'emqx@emqx-core-0.emqx-headless.emqx-operator-system.svc.cluster.local'}},{version,{{8,0},{'emqx@emqx-core-2.emqx-headless.emqx-operator-system.svc.cluster.local',{1666,86350,531401}}}}]
2022-10-18T11:03:46.809511+00:00 [error] Generic server mnesia_subscr terminating. Reason: killed. Last message: {'EXIT',<0.1830.0>,killed}. State: {state,<0.1830.0>,#Ref<0.1684304479.1029832724.
194029>}.
2022-10-18T11:03:46.809864+00:00 [error] Generic server mnesia_monitor terminating. Reason: killed. Last message: {'EXIT',<0.1830.0>,killed}. State: {state,<0.1830.0>,[],[],true,[],undefined,[],[
]}.
2022-10-18T11:03:46.809510+00:00 [error] Generic server mnesia_recover terminating. Reason: killed. Last message: {'EXIT',<0.1830.0>,killed}. State: {state,<0.1830.0>,undefined,undefined,undefine
d,0,false,true,[]}.
2022-10-18T11:03:46.811065+00:00 [error] crasher: initial call: mnesia_subscr:init/1, pid: <0.1832.0>, registered_name: mnesia_subscr, exit: {killed,[{gen_server,decode_msg,9,[{file,"gen_server.e
rl"},{line,481}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,226}]}]}, ancestors: [mnesia_kernel_sup,mnesia_sup,<0.1826.0>], message_queue_len: 0, messages: [], links: [], dictionar
y: [], trap_exit: true, status: running, heap_size: 2586, stack_size: 29, reductions: 6297; neighbours:
2022-10-18T11:03:46.810244+00:00 [error] crasher: initial call: application_master:init/4, pid: <0.1825.0>, registered_name: [], exit: {{normal,{mnesia_app,start,[normal,]]}},[{application_master
,init,4,[{file,"application_master.erl"},{line,142}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,226}]}]}, ancestors: [<0.1824.0>], message_queue_len: 1, messages: [{'EXIT',<0.1826.
0>,normal}], links: [<0.1824.0>,<0.1685.0>], dictionary: [], trap_exit: true, status: running, heap_size: 376, stack_size: 29, reductions: 167; neighbours:
2022-10-18T11:03:46.811386+00:00 [error] crasher: initial call: gen_event:init_it/6, pid: <0.1828.0>, registered_name: mnesia_event, exit: {killed,[{gen_event,terminate_server,4,[{file,"gen_event
.erl"},{line,405}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,226}]}]}, ancestors: [mnesia_sup,<0.1826.0>], message_queue_len: 1, messages: [{notify,{mnesia_system_event,{mnesia_do
wn,'emqx@emqx-core-0.emqx-headless.emqx-operator-system.svc.cluster.local'}}}], links: [], dictionary: [], trap_exit: true, status: running, heap_size: 10958, stack_size: 29, reductions: 27853; n
eighbours:
2022-10-18T11:03:46.813591+00:00 [error] crasher: initial call: mnesia_monitor:init/1, pid: <0.1831.0>, registered_name: mnesia_monitor, exit: {killed,[{gen_server,decode_msg,9,[{file,"gen_server
.erl"},{line,481}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,226}]}]}, ancestors: [mnesia_kernel_sup,mnesia_sup,<0.1826.0>], message_queue_len: 0, messages: [], links: [<53724.183
1.0>,<53725.3553.0>,<0.1873.0>], dictionary: [], trap_exit: true, status: running, heap_size: 6772, stack_size: 29, reductions: 13705; neighbours:
2022-10-18T11:03:46.813855+00:00 [error] crasher: initial call: application_master:init/4, pid: <0.1816.0>, registered_name: [], exit: {{bad_return,{{mria_app,start,[normal,]]}
Hi, @mariusstaicu has your PVC been pre-populated with historical data prior to EMQX deployment?
It might be caused by inconsistent schemas (on different nodes) that the mnesia don't know how to merge.
The schema on node 'emqx@emqx-core-1.*'
says it has data syncing with emqx@emqx-core-2.*
and emqx@emqx-core-1.*
, but the other schema on emqx@emqx-core-0.*
says it has data syncing with emqx@emqx-core-2
, emqx@emqx-core-1
, and emqx@emqx-core-0
.
It might because you have had built a cluster only has nodes core-1
and core-2
, persisted the stale data/mneisa/ dir (on node core-1
) and try to load it with 3 nodes. Please delete the data/mnesia on core-1 and try again.
The cluster has 3 nodes in first deployment and also in second. I do not make cluster topology changes manually, I let the operator provision all. Is there a way to make this work without manually deleting mnesia dir ?
Hi, @mariusstaicu has your PVC been pre-populated with historical data prior to EMQX deployment?
No, it just contains the data from emqx normal deployment, no other modifications done. So deploy EMQX resource, make some changes like changing the dashboard password, delete EMQX resource and recreate again while keeping the PVCs which is the default behaviour.
Hi, @mariusstaicu could you please delete .spec.coreTeamplte.spec.lifecycle
retry all the steps?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Hi @Rory-Z I will try.
Hi I can confirm that the problem doesn't happen anymore when I remove the lifecycle.
Hi I can confirm that the problem doesn't happen anymore when I remove the lifecycle.
Cool.
I will close this issue, if you have any questions you can always open it