aweber/rabbitmq-autocluster

Cannot Add new nodes in AWS AutoScaling Group

spember opened this issue · 7 comments

Not sure exactly what's happening, but I'm seeing failures of nodes to join in the cluster, about 50% of the time and there's nothing meaningful I can find in the logs.
Scenario:

  • launch AMI with rabbitmq w/autocluster in AutoScaling group. Instances may permissions to describe instances in autoscaling group. Instance comes up without a problem, launching a cluster of 1 node
  • increase autoscaling group to have a min of more than 1. Instances are launched. About half the time, the node will fail to start rabbitmq.
  • If failure, I can terminate the instance. A new one will appear and will connect just fine.

There appears to be no reason why some nodes will fail. e.g. no differences in availability zone.

The logs on failed nodes look like this:

=INFO REPORT==== 16-Sep-2016::14:58:35 ===
Starting RabbitMQ 3.6.5 on Erlang 19.0.3
Copyright (C) 2007-2016 Pivotal Software, Inc.
Licensed under the MPL.  See http://www.rabbitmq.com/

=INFO REPORT==== 16-Sep-2016::14:58:35 ===
node           : rabbit@<ipaddress>
home dir       : /var/lib/rabbitmq
config file(s) : /etc/rabbitmq/rabbitmq.config
cookie hash    : znKyrXfwLkBEHmOK3h9zxA==
log            : /var/log/rabbitmq/rabbit@<ipaddress>.log
sasl log       : /var/log/rabbitmq/rabbit@<ipaddress>-sasl.log
database dir   : /var/lib/rabbitmq/mnesia/rabbit@<ipaddress>

=INFO REPORT==== 16-Sep-2016::14:58:36 ===
autocluster: Delaying startup for 3270ms.

=INFO REPORT==== 16-Sep-2016::14:58:40 ===
autocluster: Starting aws registration.

=INFO REPORT==== 16-Sep-2016::14:58:40 ===
Error description:
   {could_not_start,rabbit,
       {function_clause,
           [{autocluster,maybe_register,
                [error,aws,autocluster_aws],
                [{file,"src/autocluster.erl"},{line,111}]},
            {autocluster,init,0,[{file,"src/autocluster.erl"},{line,33}]},
            {rabbit_boot_steps,'-run_step/2-lc$^1/1-1-',1,
                [{file,"src/rabbit_boot_steps.erl"},{line,49}]},
            {rabbit_boot_steps,run_step,2,
                [{file,"src/rabbit_boot_steps.erl"},{line,49}]},
            {rabbit_boot_steps,'-run_boot_steps/1-lc$^0/1-0-',1,
                [{file,"src/rabbit_boot_steps.erl"},{line,26}]},
            {rabbit_boot_steps,run_boot_steps,1,
                [{file,"src/rabbit_boot_steps.erl"},{line,26}]},
            {rabbit,start,2,[{file,"src/rabbit.erl"},{line,583}]},
            {application_master,start_it_old,4,
                [{file,"application_master.erl"},{line,273}]}]}}

Log files (may contain more information):
<this points to the current file>

Is there a step I'm missing? More importantly, could we got some more meaningful information about the error in the logs?

autocluster:maybe_register/3 failed but there's little detail about what's going on. Please use correct GitHub formatting, perhaps that swallows some of the log?

Whoops. Is that better, @michaelklishin ? Note that this is the entire file

That at least contains a line in autocluster.erl, thank you.

I'm running into what appears to be this same problem. I also have a crash report with a little more information. In contrast to @spember, however, I haven't been able to get the autocluster plugin to work at all.

=CRASH REPORT==== 27-Sep-2016::10:23:59 ===
  crasher:
    initial call: application_master:init/4
    pid: <0.155.0>
    registered_name: []
    exception exit: {bad_return,
                        {{rabbit,start,[normal,[]]},
                         {'EXIT',
                             {function_clause,
                                 [{autocluster,maybe_register,
                                      [error,aws,autocluster_aws],
                                      [{file,"src/autocluster.erl"},
                                       {line,111}]},
                                  {autocluster,init,0,
                                      [{file,"src/autocluster.erl"},
                                       {line,33}]},
                                  {rabbit_boot_steps,
                                      '-run_step/2-lc$^1/1-1-',1,
                                      [{file,"src/rabbit_boot_steps.erl"},
                                       {line,49}]},
                                  {rabbit_boot_steps,run_step,2,
                                      [{file,"src/rabbit_boot_steps.erl"},
                                       {line,49}]},
                                  {rabbit_boot_steps,
                                      '-run_boot_steps/1-lc$^0/1-0-',1,
                                      [{file,"src/rabbit_boot_steps.erl"},
                                       {line,26}]},
                                  {rabbit_boot_steps,run_boot_steps,1,
                                      [{file,"src/rabbit_boot_steps.erl"},
                                       {line,26}]},
                                  {rabbit,start,2,
                                      [{file,"src/rabbit.erl"},{line,583}]},
                                  {application_master,start_it_old,4,
                                      [{file,"application_master.erl"},
                                       {line,273}]}]}}}}
      in function  application_master:init/4 (application_master.erl, line 134)
    ancestors: [<0.154.0>]
    messages: [{'EXIT',<0.156.0>,normal}]
    links: [<0.154.0>,<0.7.0>]
    dictionary: []
    trap_exit: true
    status: running
    heap_size: 2586
    stack_size: 27
    reductions: 255
  neighbours:

Looks very similar to the issue I've just raised #104

#104 has a few comments that outline what seems to be going on. I'd close it in favour of that issue.

gmr commented

This plugin was forked by the RabbitMQ team and is now part of RabbitMQ. More information can be found @ https://github.com/rabbitmq/rabbitmq-autocluster