uselagoon/lagoon

Broker pods erroring with "Waiting for Mnesia tables"

achton opened this issue · 2 comments

achton commented

Describe the bug

In our hosted Lagoon cluster (v2.16.0 on AKS 1.25.15), we are seeing issues with the lagoon-core-broker StatefulSet. From time to time, the pods error out with errors like these continuously. When I arrive at the scene, there is usually only one pod alive (lagoon-core-broker-0), which is not running.

2023-11-30 11:08:32.206543+00:00 [info] <0.221.0> Waiting for Mnesia tables for 30000 ms, 8 retries left
2023-11-30 11:08:32.206287+00:00 [warning] <0.221.0>                                          rabbit_durable_queue]}
2023-11-30 11:08:32.206287+00:00 [warning] <0.221.0>                                          rabbit_runtime_parameters,
2023-11-30 11:08:32.206287+00:00 [warning] <0.221.0>                                          rabbit_durable_exchange,
2023-11-30 11:08:32.206287+00:00 [warning] <0.221.0>                                          rabbit_durable_route,
2023-11-30 11:08:32.206287+00:00 [warning] <0.221.0>                                          rabbit_topic_permission,rabbit_vhost,
2023-11-30 11:08:32.206287+00:00 [warning] <0.221.0>                                         [rabbit_user,rabbit_user_permission,
2023-11-30 11:08:32.206287+00:00 [warning] <0.221.0>                                          'rabbit@lagoon-core-broker-0.lagoon-core-broker-headless.lagoon-core.svc.cluster.local'],
2023-11-30 11:08:32.206287+00:00 [warning] <0.221.0>                                          'rabbit@lagoon-core-broker-1.lagoon-core-broker-headless.lagoon-core.svc.cluster.local',
2023-11-30 11:08:32.206287+00:00 [warning] <0.221.0>                                         ['rabbit@lagoon-core-broker-2.lagoon-core-broker-headless.lagoon-core.svc.cluster.local',
2023-11-30 11:08:32.206287+00:00 [warning] <0.221.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,
2023-11-30 11:08:02.204462+00:00 [info] <0.221.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2023-11-30 11:08:02.203766+00:00 [error] <0.221.0>                 [{file,"rabbit_mnesia.erl"},{line,645}]}]
2023-11-30 11:08:02.203766+00:00 [error] <0.221.0>  {rabbit_mnesia,ensure_feature_flags_are_in_sync,2,
2023-11-30 11:08:02.203766+00:00 [error] <0.221.0>                        [{file,"rabbit_feature_flags.erl"},{line,2082}]},
2023-11-30 11:08:02.203766+00:00 [error] <0.221.0>  {rabbit_feature_flags,sync_feature_flags_with_cluster,3,
2023-11-30 11:08:02.203766+00:00 [error] <0.221.0>                        [{file,"rabbit_feature_flags.erl"},{line,2267}]},
2023-11-30 11:08:02.203766+00:00 [error] <0.221.0>  {rabbit_feature_flags,verify_which_feature_flags_are_actually_enabled,0,
2023-11-30 11:08:02.203766+00:00 [error] <0.221.0>  {maps,fold_1,3,[{file,"maps.erl"},{line,410}]},
2023-11-30 11:08:02.203766+00:00 [error] <0.221.0>                        [{file,"rabbit_feature_flags.erl"},{line,2269}]},
2023-11-30 11:08:02.203766+00:00 [error] <0.221.0>                        3,
2023-11-30 11:08:02.203766+00:00 [error] <0.221.0>  {rabbit_feature_flags,'-verify_which_feature_flags_are_actually_enabled/0-fun-0-',
2023-11-30 11:08:02.203766+00:00 [error] <0.221.0>                        [{file,"rabbit_feature_flags.erl"},{line,1602}]},
2023-11-30 11:08:02.203766+00:00 [error] <0.221.0>  {rabbit_feature_flags,run_migration_fun,3,
2023-11-30 11:08:02.203766+00:00 [error] <0.221.0>                  [{file,"rabbit_core_ff.erl"},{line,88}]},
2023-11-30 11:08:02.203766+00:00 [error] <0.221.0>  {rabbit_core_ff,quorum_queue_migration,3,
2023-11-30 11:08:02.203766+00:00 [error] <0.221.0> [{rabbit_table,wait,3,[{file,"rabbit_table.erl"},{line,121}]},
2023-11-30 11:08:02.203766+00:00 [error] <0.221.0>                                                            [rabbit_durable_queue]}}
2023-11-30 11:08:02.203766+00:00 [error] <0.221.0>                                                             'rabbit@lagoon-core-broker-0.lagoon-core-broker-headless.lagoon-core.svc.cluster.local'],
2023-11-30 11:08:02.203766+00:00 [error] <0.221.0>                                                             'rabbit@lagoon-core-broker-1.lagoon-core-broker-headless.lagoon-core.svc.cluster.local',
2023-11-30 11:08:02.203766+00:00 [error] <0.221.0>                                                            ['rabbit@lagoon-core-broker-2.lagoon-core-broker-headless.lagoon-core.svc.cluster.local',
2023-11-30 11:08:02.203766+00:00 [error] <0.221.0>                                                           {timeout_waiting_for_tables,
2023-11-30 11:08:02.203766+00:00 [error] <0.221.0> Feature flag `quorum_queue`: migration function crashed: {error,

I am able to recover from this state by exec'ing into the pod, stopping RabbitMQ and restarting it with the force_boot setting enabled:

# Stop RabbitMQ
/ $ rabbitmqctl stop_app
Stopping rabbit application on node rabbit@lagoon-core-broker-0.lagoon-
core-broker-headless.lagoon-core.svc.cluster.local ...
[...]

# Start rabbitmq with force_boot enabled.
/ $ rabbitmqctl force_boot

To Reproduce

I am unsure how to get the cluster into this state. RabbitMQ docs write this about shutdown:

Normally when you shut down a RabbitMQ cluster altogether, the first node you restart should be the last one to go down, since it may have seen things happen that other nodes did not. But sometimes that's not possible: for instance if the entire cluster loses power then all nodes may think they were not the last to shut down.

So maybe this is due to pods being evicted during automated node shutdown and not starting in the correct order.

Expected behavior

I expected the broker pods to restart correctly when they are evicted or shutdown automatically for any reason.

achton commented

Related discussion: helm/charts#13485

achton commented

This may very well be addressed by the changes in #3586 which also have to do with clustering. They are included in Lagoon v2.17.0.

We've rolled it out to the cluster which has had this problem twice now.