Broker pods erroring with "Waiting for Mnesia tables"
achton opened this issue · 2 comments
Describe the bug
In our hosted Lagoon cluster (v2.16.0 on AKS 1.25.15), we are seeing issues with the lagoon-core-broker
StatefulSet. From time to time, the pods error out with errors like these continuously. When I arrive at the scene, there is usually only one pod alive (lagoon-core-broker-0
), which is not running.
2023-11-30 11:08:32.206543+00:00 [info] <0.221.0> Waiting for Mnesia tables for 30000 ms, 8 retries left
2023-11-30 11:08:32.206287+00:00 [warning] <0.221.0> rabbit_durable_queue]}
2023-11-30 11:08:32.206287+00:00 [warning] <0.221.0> rabbit_runtime_parameters,
2023-11-30 11:08:32.206287+00:00 [warning] <0.221.0> rabbit_durable_exchange,
2023-11-30 11:08:32.206287+00:00 [warning] <0.221.0> rabbit_durable_route,
2023-11-30 11:08:32.206287+00:00 [warning] <0.221.0> rabbit_topic_permission,rabbit_vhost,
2023-11-30 11:08:32.206287+00:00 [warning] <0.221.0> [rabbit_user,rabbit_user_permission,
2023-11-30 11:08:32.206287+00:00 [warning] <0.221.0> 'rabbit@lagoon-core-broker-0.lagoon-core-broker-headless.lagoon-core.svc.cluster.local'],
2023-11-30 11:08:32.206287+00:00 [warning] <0.221.0> 'rabbit@lagoon-core-broker-1.lagoon-core-broker-headless.lagoon-core.svc.cluster.local',
2023-11-30 11:08:32.206287+00:00 [warning] <0.221.0> ['rabbit@lagoon-core-broker-2.lagoon-core-broker-headless.lagoon-core.svc.cluster.local',
2023-11-30 11:08:32.206287+00:00 [warning] <0.221.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,
2023-11-30 11:08:02.204462+00:00 [info] <0.221.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2023-11-30 11:08:02.203766+00:00 [error] <0.221.0> [{file,"rabbit_mnesia.erl"},{line,645}]}]
2023-11-30 11:08:02.203766+00:00 [error] <0.221.0> {rabbit_mnesia,ensure_feature_flags_are_in_sync,2,
2023-11-30 11:08:02.203766+00:00 [error] <0.221.0> [{file,"rabbit_feature_flags.erl"},{line,2082}]},
2023-11-30 11:08:02.203766+00:00 [error] <0.221.0> {rabbit_feature_flags,sync_feature_flags_with_cluster,3,
2023-11-30 11:08:02.203766+00:00 [error] <0.221.0> [{file,"rabbit_feature_flags.erl"},{line,2267}]},
2023-11-30 11:08:02.203766+00:00 [error] <0.221.0> {rabbit_feature_flags,verify_which_feature_flags_are_actually_enabled,0,
2023-11-30 11:08:02.203766+00:00 [error] <0.221.0> {maps,fold_1,3,[{file,"maps.erl"},{line,410}]},
2023-11-30 11:08:02.203766+00:00 [error] <0.221.0> [{file,"rabbit_feature_flags.erl"},{line,2269}]},
2023-11-30 11:08:02.203766+00:00 [error] <0.221.0> 3,
2023-11-30 11:08:02.203766+00:00 [error] <0.221.0> {rabbit_feature_flags,'-verify_which_feature_flags_are_actually_enabled/0-fun-0-',
2023-11-30 11:08:02.203766+00:00 [error] <0.221.0> [{file,"rabbit_feature_flags.erl"},{line,1602}]},
2023-11-30 11:08:02.203766+00:00 [error] <0.221.0> {rabbit_feature_flags,run_migration_fun,3,
2023-11-30 11:08:02.203766+00:00 [error] <0.221.0> [{file,"rabbit_core_ff.erl"},{line,88}]},
2023-11-30 11:08:02.203766+00:00 [error] <0.221.0> {rabbit_core_ff,quorum_queue_migration,3,
2023-11-30 11:08:02.203766+00:00 [error] <0.221.0> [{rabbit_table,wait,3,[{file,"rabbit_table.erl"},{line,121}]},
2023-11-30 11:08:02.203766+00:00 [error] <0.221.0> [rabbit_durable_queue]}}
2023-11-30 11:08:02.203766+00:00 [error] <0.221.0> 'rabbit@lagoon-core-broker-0.lagoon-core-broker-headless.lagoon-core.svc.cluster.local'],
2023-11-30 11:08:02.203766+00:00 [error] <0.221.0> 'rabbit@lagoon-core-broker-1.lagoon-core-broker-headless.lagoon-core.svc.cluster.local',
2023-11-30 11:08:02.203766+00:00 [error] <0.221.0> ['rabbit@lagoon-core-broker-2.lagoon-core-broker-headless.lagoon-core.svc.cluster.local',
2023-11-30 11:08:02.203766+00:00 [error] <0.221.0> {timeout_waiting_for_tables,
2023-11-30 11:08:02.203766+00:00 [error] <0.221.0> Feature flag `quorum_queue`: migration function crashed: {error,
I am able to recover from this state by exec'ing into the pod, stopping RabbitMQ and restarting it with the force_boot
setting enabled:
# Stop RabbitMQ
/ $ rabbitmqctl stop_app
Stopping rabbit application on node rabbit@lagoon-core-broker-0.lagoon-
core-broker-headless.lagoon-core.svc.cluster.local ...
[...]
# Start rabbitmq with force_boot enabled.
/ $ rabbitmqctl force_boot
To Reproduce
I am unsure how to get the cluster into this state. RabbitMQ docs write this about shutdown:
Normally when you shut down a RabbitMQ cluster altogether, the first node you restart should be the last one to go down, since it may have seen things happen that other nodes did not. But sometimes that's not possible: for instance if the entire cluster loses power then all nodes may think they were not the last to shut down.
So maybe this is due to pods being evicted during automated node shutdown and not starting in the correct order.
Expected behavior
I expected the broker pods to restart correctly when they are evicted or shutdown automatically for any reason.
Related discussion: helm/charts#13485