RabbitMQ is not able to recover/start when its not gracefully shutdown
Closed this issue · 4 comments
Describe the bug
RabbitMQ is not able to start when its not gracefully shutdown when its killed due to some reason (eg. OOMkilled etc..). When this plugin is disabled, RabbitMQ is able to start though
Here are the logs that we are getting
Running boot step database_sync defined by app rabbit�[0m
2024-02-05 13:11:47.492139+00:00 [info] <0.221.0> Running boot step feature_flags defined by app rabbit�[0m
2024-02-05 13:11:47.492470+00:00 [info] <0.221.0> Running boot step codec_correctness_check defined by app rabbit�[0m
2024-02-05 13:11:47.492535+00:00 [info] <0.221.0> Running boot step external_infrastructure defined by app rabbit�[0m
2024-02-05 13:11:47.492580+00:00 [info] <0.221.0> Running boot step rabbit_delayed_message defined by app rabbitmq_delayed_message_exchange�[0m
2024-02-05 13:11:47.492914+00:00 [info] <0.221.0> Waiting for Mnesia tables for 30000 ms, 0 retries left�[0m
�[38;5;160m2024-02-05 13:12:17.506034+00:00 [error] <0.271.0> Error in process <0.271.0> on node 'rabbit@rabbitmq-inbox-0' with exit value:�[0m
�[38;5;160m2024-02-05 13:12:17.506034+00:00 [error] <0.271.0> {badarg,�[0m
�[38;5;160m2024-02-05 13:12:17.506034+00:00 [error] <0.271.0> [{ets,insert,�[0m
�[38;5;160m2024-02-05 13:12:17.506034+00:00 [error] <0.271.0> ['rabbit_delayed_messagerabbit@rabbitmq-inbox-0',�[0m
�[38;5;160m2024-02-05 13:12:17.506034+00:00 [error] <0.271.0> [{delay_entry,�[0m
�[38;5;160m2024-02-05 13:12:17.506034+00:00 [error] <0.271.0> {delay_key,1707152400070,�[0m
�[38;5;160m2024-02-05 13:12:17.506034+00:00 [error] <0.271.0> {exchange,�[0m
�[38;5;160m2024-02-05 13:12:17.506034+00:00 [error] <0.271.0> {resource,<<"/">>,exchange,�[0m
�[38;5;160m2024-02-05 13:12:17.506034+00:00 [error] <0.271.0> <<"direct-delay-agent-exchange">>},�[0m
�[38;5;160m2024-02-05 13:12:17.506034+00:00 [error] <0.271.0> 'x-delayed-message',true,false,false,�[0m
�[38;5;160m2024-02-05 13:12:17.506034+00:00 [error] <0.271.0> [{<<"x-delayed-type">>,longstr,<<"direct">>}],�[0m
�[38;5;160m2024-02-05 13:12:17.506034+00:00 [error] <0.271.0> undefined,undefined,undefined,�[0m
�[38;5;160m2024-02-05 13:12:17.506034+00:00 [error] <0.271.0> {[],[]},�[0m
�[38;5;160m2024-02-05 13:12:17.506034+00:00 [error] <0.271.0> #{user => <<"guest">>}}},�[0m
Erlang version: 24.3.4.2
RabbitMQ version: 3.10.6
Reproduction steps
- Make Rabbitmq crash with SIGKILL
- Start back using
rabbitmq-server
- It should crash again
Expected behavior
RabbitMQ should be able to recover back even after a crash as it does without this plugin installed
Additional context
No response
RabbitMQ 3.10 has reached end of life.
This plugin is very unlikely to receive any attention from the core team outside of #253 (#229).
But one thing that immediately stands out from this exception is
2024-02-05 13:11:47.492914+00:00 [info] <0.221.0> Waiting for Mnesia tables for 30000 ms, 0 retries left
which immediately makes me wonder if this node actually has booted by the time the plugin tried to use its schema tables. If not, then this plugin has very few options as to what it could do to avoid this exception.
Seems like the node is not booted, and since we are running it in standalone mode, it wouldn't be able to contact with other nodes too?
Is this problem solved in rabbitmq v3.12?
I'm afraid I don't know what this "standalone mode" is. Nodes contact their peers fairly early in the process, before plugins are enabled:
We do not guess in this community, so that's as much as I can say from a few log messages.
@lokesh411 we don't understand what the problem is, so I'm not going to tell you if it's been solved or not. We do not guess in this community. 3.12.x is the only version with active community support.
Nothing in 3.12 has changed around how nodes form clusters or contact their peers on restart. Definitely nothing fundamental. The only change related to plugin activation that I recall has moved this step to the latest possible moment, right before definition import (the final step). If anything, this makes it less likely that a plugin that declares its own tables, like this one, would try to do so before all tables were synced.
Anyhow, these are just guesses after guesses with a few log lines.