aweber/rabbitmq-autocluster

k8s issue

Stono opened this issue · 33 comments

Stono commented

@xek - have you seen this before?

autocluster: Could not fetch node list from autocluster_k8s: {failed_connect,
                                                              [{to_address,
                                                                {"kubernetes.default.svc.cluster.local",
                                                                 443}},
                                                               {inet,
                                                                [inet],
                                                                ssl_not_started}]}.

curl works.

@Stono a Kubernetes client failed to connect to "kubernetes.default.svc.cluster.local" with TLS enabled because the ssl app was not started.

Looks like the Kubernetes backend needs to make sure all TLS-related apps are started first.

Stono commented

@michaelklishin I'm not sure I follow you?

@Stono in Erlang many standard library parts are applications (microservices if you will). The one that's responsible for the TLS implementation is not running at the time a connection to Kubernetes endpoint was attempted, so it failed with failed_connect.

Stono commented

@michaelklishin ok. any suggestions on how to fix this?

@Stono start with posting the entire log. What you posted doesn't contain a single module or function name, leave alone line numbers.

in #89 the ssl app was added as a dependency, so the maintainers of this will need a way to reproduce, a single line error message is not enough.

Stono commented
$ ki logs rabbitmq-3096751937-8z93r
Using default rabbitmq configuration as no config found in /storage/rabbitmq.config
Using default rabbitmq definitions as no config found in /storage/rabbitmq_definitions.json
Preboot complete, starting rabbitmq...
Using ip address as hostname
Hostname: 10.20.1.6, RabbitMQ Nodename: rabbit@10.20.1.6
[INFO] RabbitMQ Started (PID: 17, ServerPid: 17)!

              RabbitMQ 3.6.5. Copyright (C) 2007-2016 Pivotal Software, Inc.
  ##  ##      Licensed under the MPL.  See http://www.rabbitmq.com/
  ##  ##
  ##########  Logs: /storage/log/rabbit@10.20.1.6.log
  ######  ##        /storage/log/rabbit@10.20.1.6-sasl.log
  ##########
              Starting broker...

=INFO REPORT==== 7-Sep-2016::11:14:50 ===
Starting RabbitMQ 3.6.5 on Erlang 19.0
Copyright (C) 2007-2016 Pivotal Software, Inc.
Licensed under the MPL.  See http://www.rabbitmq.com/

=INFO REPORT==== 7-Sep-2016::11:14:50 ===
node           : rabbit@10.20.1.6
home dir       : /home/rabbitmq
config file(s) : /storage/rabbitmq.config
cookie hash    : sjAQ44vvJUb2ijPGD0IntA==
log            : /storage/log/rabbit@10.20.1.6.log
sasl log       : /storage/log/rabbit@10.20.1.6-sasl.log
database dir   : /storage/mnesia/rabbit@10.20.1.6

=INFO REPORT==== 7-Sep-2016::11:14:57 ===
autocluster: Using k8s backend

=INFO REPORT==== 7-Sep-2016::11:14:57 ===
autocluster: Delaying startup for 3699ms.

=INFO REPORT==== 7-Sep-2016::11:15:01 ===
autocluster: Starting k8s registration.

=INFO REPORT==== 7-Sep-2016::11:15:01 ===
autocluster: GET https://kubernetes.default.svc.cluster.local:443/api/v1/namespaces/td-integration/endpoints/rabbitmq

=INFO REPORT==== 7-Sep-2016::11:15:01 ===
autocluster: Response: [{error,
                            {failed_connect,
                                [{to_address,
                                     {"kubernetes.default.svc.cluster.local",
                                      443}},
                                 {inet,[inet],ssl_not_started}]}}]

=INFO REPORT==== 7-Sep-2016::11:15:01 ===
autocluster: HTTP Error {failed_connect,
                            [{to_address,
                                 {"kubernetes.default.svc.cluster.local",443}},
                             {inet,[inet],ssl_not_started}]}

=INFO REPORT==== 7-Sep-2016::11:15:01 ===
ERROR: "autocluster: Failed to get nodes from k8s - ~s~n" - [{failed_connect,
                                                              [{to_address,
                                                                {"kubernetes.default.svc.cluster.local",
                                                                 443}},
                                                               {inet,
                                                                [inet],
                                                                ssl_not_started}]}]

=ERROR REPORT==== 7-Sep-2016::11:15:01 ===
autocluster: Could not fetch node list from autocluster_k8s: {failed_connect,
                                                              [{to_address,
                                                                {"kubernetes.default.svc.cluster.local",
                                                                 443}},
                                                               {inet,
                                                                [inet],
                                                                ssl_not_started}]}.

=INFO REPORT==== 7-Sep-2016::11:15:01 ===
Memory limit set to 2999MB of 7499MB total.

=INFO REPORT==== 7-Sep-2016::11:15:01 ===
Disk free limit set to 50MB

=INFO REPORT==== 7-Sep-2016::11:15:01 ===
Limiting to approx 924 file handles (829 sockets)

=INFO REPORT==== 7-Sep-2016::11:15:01 ===
FHC read buffering:  OFF
FHC write buffering: ON

=INFO REPORT==== 7-Sep-2016::11:15:01 ===
Database directory at /storage/mnesia/rabbit@10.20.1.6 is empty. Initialising from scratch...

=INFO REPORT==== 7-Sep-2016::11:15:01 ===
Priority queues enabled, real BQ is rabbit_variable_queue

=INFO REPORT==== 7-Sep-2016::11:15:01 ===
Management plugin: using rates mode 'basic'

=INFO REPORT==== 7-Sep-2016::11:15:01 ===
msg_store_transient: using rabbit_msg_store_ets_index to provide index

=INFO REPORT==== 7-Sep-2016::11:15:01 ===
msg_store_persistent: using rabbit_msg_store_ets_index to provide index

=WARNING REPORT==== 7-Sep-2016::11:15:01 ===
msg_store_persistent: rebuilding indices from scratch

=INFO REPORT==== 7-Sep-2016::11:15:01 ===
Applying definitions from: /storage/rabbitmq_definitions.json

=INFO REPORT==== 7-Sep-2016::11:15:01 ===
Creating user 'guest'

=INFO REPORT==== 7-Sep-2016::11:15:01 ===
Changing password for 'guest'

=INFO REPORT==== 7-Sep-2016::11:15:01 ===
Setting user tags for user 'guest' to [administrator]

=INFO REPORT==== 7-Sep-2016::11:15:01 ===
Adding vhost '/'

=INFO REPORT==== 7-Sep-2016::11:15:01 ===
Setting permissions for 'guest' in '/' to '.*', '.*', '.*'

=INFO REPORT==== 7-Sep-2016::11:15:01 ===
started TCP Listener on [::]:5672

=INFO REPORT==== 7-Sep-2016::11:15:01 ===
Management plugin started. Port: 15672

=INFO REPORT==== 7-Sep-2016::11:15:01 ===
Statistics event collector started.

=INFO REPORT==== 7-Sep-2016::11:15:01 ===
Statistics channel stats collector started.

=INFO REPORT==== 7-Sep-2016::11:15:01 ===
Statistics queue stats collector started.

=INFO REPORT==== 7-Sep-2016::11:15:01 ===
Statistics database started.

=INFO REPORT==== 7-Sep-2016::11:15:01 ===
Statistics garbage collector started for table aggr_queue_stats_fine_stats with interval 5000.

=ERROR REPORT==== 7-Sep-2016::11:15:01 ===
Failed to retrieve AWS credentials: undefined

=INFO REPORT==== 7-Sep-2016::11:15:01 ===
autocluster: (cleanup) Timer started {60,false}

=INFO REPORT==== 7-Sep-2016::11:15:01 ===
Statistics garbage collector started for table aggr_queue_stats_deliver_get with interval 5000.

=INFO REPORT==== 7-Sep-2016::11:15:01 ===
Statistics garbage collector started for table aggr_queue_stats_queue_msg_counts with interval 5000.

=INFO REPORT==== 7-Sep-2016::11:15:01 ===
Statistics garbage collector started for table aggr_queue_stats_queue_msg_rates with interval 5000.

=INFO REPORT==== 7-Sep-2016::11:15:01 ===
Statistics garbage collector started for table aggr_queue_stats_process_stats with interval 5000.

=INFO REPORT==== 7-Sep-2016::11:15:01 ===
Statistics garbage collector started for table aggr_queue_exchange_stats_fine_stats with interval 5000.

=INFO REPORT==== 7-Sep-2016::11:15:01 ===
Statistics garbage collector started for table aggr_vhost_stats_deliver_get with interval 5000.

=INFO REPORT==== 7-Sep-2016::11:15:01 ===
Statistics garbage collector started for table aggr_vhost_stats_fine_stats with interval 5000.

=INFO REPORT==== 7-Sep-2016::11:15:01 ===
Statistics garbage collector started for table aggr_vhost_stats_queue_msg_rates with interval 5000.

=INFO REPORT==== 7-Sep-2016::11:15:01 ===
Statistics garbage collector started for table aggr_vhost_stats_queue_msg_counts with interval 5000.

=INFO REPORT==== 7-Sep-2016::11:15:01 ===
Statistics garbage collector started for table aggr_vhost_stats_coarse_conn_stats with interval 5000.

=INFO REPORT==== 7-Sep-2016::11:15:01 ===
Statistics garbage collector started for table aggr_channel_queue_stats_deliver_get with interval 5000.

=INFO REPORT==== 7-Sep-2016::11:15:01 ===
Statistics garbage collector started for table aggr_channel_queue_stats_fine_stats with interval 5000.

=INFO REPORT==== 7-Sep-2016::11:15:01 ===
Statistics garbage collector started for table aggr_channel_queue_stats_queue_msg_counts with interval 5000.

=INFO REPORT==== 7-Sep-2016::11:15:01 ===
Statistics garbage collector started for table aggr_channel_stats_deliver_get with interval 5000.
 completed with 12 plugins.

=INFO REPORT==== 7-Sep-2016::11:15:01 ===
Server startup complete; 12 plugins started.
 * autocluster
 * rabbitmq_aws
 * rabbitmq_shovel_management
 * rabbitmq_management_visualiser
 * rabbitmq_management
 * rabbitmq_web_dispatch
 * webmachine
 * mochiweb
 * rabbitmq_shovel
 * amqp_client
 * rabbitmq_consistent_hash_exchange
 * rabbitmq_management_agent

=INFO REPORT==== 7-Sep-2016::11:15:01 ===
Statistics garbage collector started for table aggr_channel_stats_fine_stats with interval 5000.

=INFO REPORT==== 7-Sep-2016::11:15:01 ===
Statistics garbage collector started for table aggr_channel_stats_queue_msg_counts with interval 5000.

=INFO REPORT==== 7-Sep-2016::11:15:01 ===
Statistics garbage collector started for table aggr_channel_stats_process_stats with interval 5000.

=INFO REPORT==== 7-Sep-2016::11:15:01 ===
Statistics garbage collector started for table aggr_channel_exchange_stats_deliver_get with interval 5000.

=INFO REPORT==== 7-Sep-2016::11:15:01 ===
Statistics garbage collector started for table aggr_channel_exchange_stats_fine_stats with interval 5000.

=INFO REPORT==== 7-Sep-2016::11:15:01 ===
Statistics garbage collector started for table aggr_exchange_stats_fine_stats with interval 5000.

=INFO REPORT==== 7-Sep-2016::11:15:01 ===
Statistics garbage collector started for table aggr_node_stats_coarse_node_stats with interval 5000.

=INFO REPORT==== 7-Sep-2016::11:15:01 ===
Statistics garbage collector started for table aggr_node_node_stats_coarse_node_node_stats with interval 5000.

=INFO REPORT==== 7-Sep-2016::11:15:01 ===
Statistics garbage collector started for table aggr_connection_stats_coarse_conn_stats with interval 5000.

=INFO REPORT==== 7-Sep-2016::11:15:01 ===
Statistics garbage collector started for table aggr_connection_stats_process_stats with interval 5000.

=INFO REPORT==== 7-Sep-2016::11:15:01 ===
Statistics garbage collector started for table channel_stats with interval 5000.

=INFO REPORT==== 7-Sep-2016::11:15:01 ===
Statistics garbage collector started for table connection_stats with interval 5000.

=INFO REPORT==== 7-Sep-2016::11:16:01 ===
autocluster: (cleanup) checking cluster

=INFO REPORT==== 7-Sep-2016::11:16:01 ===
autocluster: (cleanup) Checking for partitioned nodes.

=INFO REPORT==== 7-Sep-2016::11:16:01 ===
autocluster: (cleanup) No partitioned nodes found.

xek commented

application:ensure_all_started(ssl). should work, but also
application:ensure_all_started(autocluster). should start all required apps and ssl is one of them

I'm not sure why ssl is not started in your boot process, maybe it has something to do with this: http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/2014-May/036082.html

Stono commented

@xek if I go onto the box and do:

 [root@rabbitmq-3440553795-05v6f storage]# erl
Erlang/OTP 19 [erts-8.0] [source-6dc93c1] [64-bit] [smp:2:2] [async-threads:10] [hipe] [kernel-poll:false]

Eshell V8.0  (abort with ^G)
1>
1> application:start(inets).
ok
2> application:start(crypto).
ok
3> application:start(asn1).
ok
4> application:start(public_key).
ok
5> application:start(ssl).
ok
6> exit

then rabbitmqctl stop_app and rabbitmqctl start_app, it works, thus confirming those services aren't starting on application boot.

Adding {ssl_listeners, [5671]} to the config (even though I don't want it) also (hacky)fixes the problem.

Stono commented

If you want to replicate the issue - see the referenced repository which the setup captured in a Dockerfile.

Stono commented

Man, this is an epic mess.

td.rabbitmq_1  | =INFO REPORT==== 7-Sep-2016::12:09:08 ===
td.rabbitmq_1  | Starting RabbitMQ 3.6.5 on Erlang 19.0
td.rabbitmq_1  | Copyright (C) 2007-2016 Pivotal Software, Inc.
td.rabbitmq_1  | Licensed under the MPL.  See http://www.rabbitmq.com/
td.rabbitmq_1  |
td.rabbitmq_1  |               RabbitMQ 3.6.5. Copyright (C) 2007-2016 Pivotal Software, Inc.
td.rabbitmq_1  |   ##  ##      Licensed under the MPL.  See http://www.rabbitmq.com/
td.rabbitmq_1  |   ##  ##
td.rabbitmq_1  |   ##########  Logs: /storage/log/rabbit@8ee1683721ab.log
td.rabbitmq_1  |   ######  ##        /storage/log/rabbit@8ee1683721ab-sasl.log
td.rabbitmq_1  |   ##########
td.rabbitmq_1  |               Starting broker...
td.rabbitmq_1  |
td.rabbitmq_1  | =INFO REPORT==== 7-Sep-2016::12:09:08 ===
td.rabbitmq_1  | node           : rabbit@8ee1683721ab
td.rabbitmq_1  | home dir       : /home/rabbitmq
td.rabbitmq_1  | config file(s) : /storage/rabbitmq.config
td.rabbitmq_1  | cookie hash    : rgMV4ejgS37e3/NLuAH/PQ==
td.rabbitmq_1  | log            : /storage/log/rabbit@8ee1683721ab.log
td.rabbitmq_1  | sasl log       : /storage/log/rabbit@8ee1683721ab-sasl.log
td.rabbitmq_1  | database dir   : /storage/mnesia/rabbit@8ee1683721ab
td.rabbitmq_1  |
td.rabbitmq_1  | =INFO REPORT==== 7-Sep-2016::12:09:10 ===
td.rabbitmq_1  | autocluster: Using consul backend
td.rabbitmq_1  |
td.rabbitmq_1  | =INFO REPORT==== 7-Sep-2016::12:09:10 ===
td.rabbitmq_1  | autocluster: Delaying startup for 3823ms.
td.rabbitmq_1  |
td.rabbitmq_1  | =INFO REPORT==== 7-Sep-2016::12:09:14 ===
td.rabbitmq_1  | autocluster: Starting consul registration.
td.rabbitmq_1  |
td.rabbitmq_1  | =INFO REPORT==== 7-Sep-2016::12:09:14 ===
td.rabbitmq_1  | autocluster: GET http://consul:8500/v1/health/service/td-rabbitmq?passing
td.rabbitmq_1  |
td.rabbitmq_1  | =INFO REPORT==== 7-Sep-2016::12:09:14 ===
td.rabbitmq_1  | autocluster: Response: [{ok,{{"HTTP/1.1",200,"OK"},
td.rabbitmq_1  |                              [{"date","Wed, 07 Sep 2016 12:09:14 GMT"},
td.rabbitmq_1  |                               {"content-length","2"},
td.rabbitmq_1  |                               {"content-type","application/json"},
td.rabbitmq_1  |                               {"x-consul-index","7"},
td.rabbitmq_1  |                               {"x-consul-knownleader","true"},
td.rabbitmq_1  |                               {"x-consul-lastcontact","0"}],
td.rabbitmq_1  |                              "[]"}}]
td.rabbitmq_1  |
td.rabbitmq_1  | =INFO REPORT==== 7-Sep-2016::12:09:14 ===
td.rabbitmq_1  | autocluster: Registering node with consul.
td.rabbitmq_1  |
td.rabbitmq_1  | =INFO REPORT==== 7-Sep-2016::12:09:14 ===
td.rabbitmq_1  | autocluster: POST http://consul:8500/v1/agent/service/register [<<"{\"ID\":\"td-rabbitmq:8ee1683721ab\",\"Name\":\"td-rabbitmq\",\"Address\":\"8ee1683721ab\",\"Port\":5672,\"Check\":{\"Notes\":\"RabbitMQ Auto-Cluster Plugin TTL Check\",\"TTL\":\"60s\"}}">>]
td.rabbitmq_1  |
td.rabbitmq_1  | =INFO REPORT==== 7-Sep-2016::12:09:14 ===
td.rabbitmq_1  | autocluster: Response: [{ok,{{"HTTP/1.1",200,"OK"},
td.rabbitmq_1  |                              [{"date","Wed, 07 Sep 2016 12:09:14 GMT"},
td.rabbitmq_1  |                               {"content-length","0"},
td.rabbitmq_1  |                               {"content-type","text/plain; charset=utf-8"}],
td.rabbitmq_1  |                              []}}]
td.rabbitmq_1  |
td.rabbitmq_1  | =INFO REPORT==== 7-Sep-2016::12:09:14 ===
td.rabbitmq_1  | autocluster: Registered node with consul.
td.rabbitmq_1  |
td.rabbitmq_1  | =INFO REPORT==== 7-Sep-2016::12:09:14 ===
td.rabbitmq_1  | autocluster: Discovered []
td.rabbitmq_1  |
td.rabbitmq_1  | =INFO REPORT==== 7-Sep-2016::12:09:14 ===
td.rabbitmq_1  | autocluster: Node appears to be the first in the cluster.
td.rabbitmq_1  |
td.rabbitmq_1  | =INFO REPORT==== 7-Sep-2016::12:09:14 ===
td.rabbitmq_1  | Memory limit set to 2383MB of 5958MB total.
td.rabbitmq_1  |
td.rabbitmq_1  | =INFO REPORT==== 7-Sep-2016::12:09:14 ===
td.rabbitmq_1  | Disk free limit set to 50MB
td.rabbitmq_1  |
td.rabbitmq_1  | =INFO REPORT==== 7-Sep-2016::12:09:14 ===
td.rabbitmq_1  | Limiting to approx 924 file handles (829 sockets)
td.rabbitmq_1  |
td.rabbitmq_1  | =INFO REPORT==== 7-Sep-2016::12:09:14 ===
td.rabbitmq_1  | FHC read buffering:  OFF
td.rabbitmq_1  | FHC write buffering: ON
td.rabbitmq_1  |
td.rabbitmq_1  | =INFO REPORT==== 7-Sep-2016::12:09:14 ===
td.rabbitmq_1  | Priority queues enabled, real BQ is rabbit_variable_queue
td.rabbitmq_1  |
td.rabbitmq_1  | =INFO REPORT==== 7-Sep-2016::12:09:14 ===
td.rabbitmq_1  | Management plugin: using rates mode 'basic'
td.rabbitmq_1  |
td.rabbitmq_1  | =INFO REPORT==== 7-Sep-2016::12:09:14 ===
td.rabbitmq_1  | msg_store_transient: using rabbit_msg_store_ets_index to provide index
td.rabbitmq_1  |
td.rabbitmq_1  | =INFO REPORT==== 7-Sep-2016::12:09:14 ===
td.rabbitmq_1  | msg_store_persistent: using rabbit_msg_store_ets_index to provide index
td.rabbitmq_1  |
td.rabbitmq_1  | =INFO REPORT==== 7-Sep-2016::12:09:14 ===
td.rabbitmq_1  | Applying definitions from: /storage/rabbitmq_definitions.json
td.rabbitmq_1  |
td.rabbitmq_1  | =INFO REPORT==== 7-Sep-2016::12:09:14 ===
td.rabbitmq_1  | Changing password for 'guest'
td.rabbitmq_1  |
td.rabbitmq_1  | =INFO REPORT==== 7-Sep-2016::12:09:14 ===
td.rabbitmq_1  | Setting user tags for user 'guest' to [administrator]
td.rabbitmq_1  |
td.rabbitmq_1  | =INFO REPORT==== 7-Sep-2016::12:09:14 ===
td.rabbitmq_1  | Setting permissions for 'guest' in '/' to '.*', '.*', '.*'
td.rabbitmq_1  |
td.rabbitmq_1  | =INFO REPORT==== 7-Sep-2016::12:09:14 ===
td.rabbitmq_1  | started TCP Listener on [::]:5672
td.rabbitmq_1  |
td.rabbitmq_1  | =INFO REPORT==== 7-Sep-2016::12:09:14 ===
td.rabbitmq_1  | stopped TCP Listener on [::]:5672
td.rabbitmq_1  |
td.rabbitmq_1  |
td.rabbitmq_1  | BOOT FAILED
td.rabbitmq_1  | ===========
td.rabbitmq_1  |
td.rabbitmq_1  | Error description:
td.rabbitmq_1  |    {could_not_start,rabbit,
td.rabbitmq_1  |        {{case_clause,
td.rabbitmq_1  |             {error,
td.rabbitmq_1  |                 {{shutdown,
td.rabbitmq_1  |                      {failed_to_start_child,
td.rabbitmq_1  |                          {ranch_listener_sup,
td.rabbitmq_1  |                              {acceptor,{0,0,0,0,0,0,0,0},5671}},
td.rabbitmq_1  |                          {shutdown,
td.rabbitmq_1  |                              {failed_to_start_child,ranch_acceptors_sup,
td.rabbitmq_1  |                                  {{badmatch,false},
td.rabbitmq_1  |                                   [{ranch_ssl,listen,1,
td.rabbitmq_1  |                                        [{file,"src/ranch_ssl.erl"},{line,88}]},
td.rabbitmq_1  |                                    {ranch_acceptors_sup,init,1,
td.rabbitmq_1  |                                        [{file,"src/ranch_acceptors_sup.erl"},
td.rabbitmq_1  |                                         {line,35}]},
td.rabbitmq_1  |                                    {supervisor,init,1,
td.rabbitmq_1  |                                        [{file,"supervisor.erl"},{line,294}]},
td.rabbitmq_1  |                                    {gen_server,init_it,6,
td.rabbitmq_1  |                                        [{file,"gen_server.erl"},{line,328}]},
td.rabbitmq_1  |                                    {proc_lib,init_p_do_apply,3,
td.rabbitmq_1  |                                        [{file,"proc_lib.erl"},
td.rabbitmq_1  |                                         {line,247}]}]}}}}},
td.rabbitmq_1  |                  {child,undefined,'rabbit_tcp_listener_sup_:::5671',
td.rabbitmq_1  |                      {tcp_listener_sup,start_link,
td.rabbitmq_1  |                          [{0,0,0,0,0,0,0,0},
td.rabbitmq_1  |                           5671,ranch_ssl,
td.rabbitmq_1  |                           [inet6,
td.rabbitmq_1  |                            {backlog,128},
td.rabbitmq_1  |                            {nodelay,true},
td.rabbitmq_1  |                            {linger,{true,0}},
td.rabbitmq_1  |                            {exit_on_close,false},
td.rabbitmq_1  |                            {versions,['tlsv1.2','tlsv1.1',tlsv1]}],
td.rabbitmq_1  |                           rabbit_connection_sup,[],
td.rabbitmq_1  |                           {rabbit_networking,tcp_listener_started,
td.rabbitmq_1  |                               ['amqp/ssl']},
td.rabbitmq_1  |                           {rabbit_networking,tcp_listener_stopped,
td.rabbitmq_1  |                               ['amqp/ssl']},
td.rabbitmq_1  |                           1,"SSL Listener"]},
td.rabbitmq_1  |                      transient,infinity,supervisor,
td.rabbitmq_1  |                      [tcp_listener_sup]}}}},
td.rabbitmq_1  |         [{rabbit_networking,start_listener0,5,
td.rabbitmq_1  |              [{file,"src/rabbit_networking.erl"},{line,294}]},
td.rabbitmq_1  |          {rabbit_networking,'-start_listener/5-lc$^0/1-0-',5,
td.rabbitmq_1  |              [{file,"src/rabbit_networking.erl"},{line,282}]},
td.rabbitmq_1  |          {rabbit_networking,start_listener,5,
td.rabbitmq_1  |              [{file,"src/rabbit_networking.erl"},{line,282}]},
td.rabbitmq_1  |          {rabbit_networking,'-boot_ssl/1-lc$^0/1-0-',3,
td.rabbitmq_1  |              [{file,"src/rabbit_networking.erl"},{line,142}]},
td.rabbitmq_1  |          {rabbit_networking,boot_ssl,1,
td.rabbitmq_1  |              [{file,"src/rabbit_networking.erl"},{line,142}]},
td.rabbitmq_1  |          {rabbit_networking,boot,0,
td.rabbitmq_1  |              [{file,"src/rabbit_networking.erl"},{line,128}]},
td.rabbitmq_1  |          {rabbit_boot_steps,'-run_step/2-lc$^1/1-1-',1,
td.rabbitmq_1  |              [{file,"src/rabbit_boot_steps.erl"},{line,49}]},
td.rabbitmq_1  |          {rabbit_boot_steps,run_step,2,
td.rabbitmq_1  |              [{file,"src/rabbit_boot_steps.erl"},{line,49}]}]}}
td.rabbitmq_1  |
td.rabbitmq_1  | Log files (may contain more information):
td.rabbitmq_1  |    /storage/log/rabbit@8ee1683721ab.log
td.rabbitmq_1  |    /storage/log/rabbit@8ee1683721ab-sasl.log
td.rabbitmq_1  |
td.rabbitmq_1  |
td.rabbitmq_1  | =INFO REPORT==== 7-Sep-2016::12:09:14 ===
td.rabbitmq_1  | Error description:
td.rabbitmq_1  |    {could_not_start,rabbit,
td.rabbitmq_1  |        {{case_clause,
td.rabbitmq_1  |             {error,
td.rabbitmq_1  |                 {{shutdown,
td.rabbitmq_1  |                      {failed_to_start_child,
td.rabbitmq_1  |                          {ranch_listener_sup,
td.rabbitmq_1  |                              {acceptor,{0,0,0,0,0,0,0,0},5671}},
td.rabbitmq_1  |                          {shutdown,
td.rabbitmq_1  |                              {failed_to_start_child,ranch_acceptors_sup,
td.rabbitmq_1  |                                  {{badmatch,false},
td.rabbitmq_1  |                                   [{ranch_ssl,listen,1,
td.rabbitmq_1  |                                        [{file,"src/ranch_ssl.erl"},{line,88}]},
td.rabbitmq_1  |                                    {ranch_acceptors_sup,init,1,
td.rabbitmq_1  |                                        [{file,"src/ranch_acceptors_sup.erl"},
td.rabbitmq_1  |                                         {line,35}]},
td.rabbitmq_1  |                                    {supervisor,init,1,
td.rabbitmq_1  |                                        [{file,"supervisor.erl"},{line,294}]},
td.rabbitmq_1  |                                    {gen_server,init_it,6,
td.rabbitmq_1  |                                        [{file,"gen_server.erl"},{line,328}]},
td.rabbitmq_1  |                                    {proc_lib,init_p_do_apply,3,
td.rabbitmq_1  |                                        [{file,"proc_lib.erl"},
td.rabbitmq_1  |                                         {line,247}]}]}}}}},
td.rabbitmq_1  |                  {child,undefined,'rabbit_tcp_listener_sup_:::5671',
td.rabbitmq_1  |                      {tcp_listener_sup,start_link,
td.rabbitmq_1  |                          [{0,0,0,0,0,0,0,0},
td.rabbitmq_1  |                           5671,ranch_ssl,
td.rabbitmq_1  |                           [inet6,
td.rabbitmq_1  |                            {backlog,128},
td.rabbitmq_1  |                            {nodelay,true},
td.rabbitmq_1  |                            {linger,{true,0}},
td.rabbitmq_1  |                            {exit_on_close,false},
td.rabbitmq_1  |                            {versions,['tlsv1.2','tlsv1.1',tlsv1]}],
td.rabbitmq_1  |                           rabbit_connection_sup,[],
td.rabbitmq_1  |                           {rabbit_networking,tcp_listener_started,
td.rabbitmq_1  |                               ['amqp/ssl']},
td.rabbitmq_1  |                           {rabbit_networking,tcp_listener_stopped,
td.rabbitmq_1  |                               ['amqp/ssl']},
td.rabbitmq_1  |                           1,"SSL Listener"]},
td.rabbitmq_1  |                      transient,infinity,supervisor,
td.rabbitmq_1  |                      [tcp_listener_sup]}}}},
td.rabbitmq_1  |         [{rabbit_networking,start_listener0,5,
td.rabbitmq_1  |              [{file,"src/rabbit_networking.erl"},{line,294}]},
td.rabbitmq_1  |          {rabbit_networking,'-start_listener/5-lc$^0/1-0-',5,
td.rabbitmq_1  |              [{file,"src/rabbit_networking.erl"},{line,282}]},
td.rabbitmq_1  |          {rabbit_networking,start_listener,5,
td.rabbitmq_1  |              [{file,"src/rabbit_networking.erl"},{line,282}]},
td.rabbitmq_1  |          {rabbit_networking,'-boot_ssl/1-lc$^0/1-0-',3,
td.rabbitmq_1  |              [{file,"src/rabbit_networking.erl"},{line,142}]},
td.rabbitmq_1  |          {rabbit_networking,boot_ssl,1,
td.rabbitmq_1  |              [{file,"src/rabbit_networking.erl"},{line,142}]},
td.rabbitmq_1  |          {rabbit_networking,boot,0,
td.rabbitmq_1  |              [{file,"src/rabbit_networking.erl"},{line,128}]},
td.rabbitmq_1  |          {rabbit_boot_steps,'-run_step/2-lc$^1/1-1-',1,
td.rabbitmq_1  |              [{file,"src/rabbit_boot_steps.erl"},{line,49}]},
td.rabbitmq_1  |          {rabbit_boot_steps,run_step,2,
td.rabbitmq_1  |              [{file,"src/rabbit_boot_steps.erl"},{line,49}]}]}}
td.rabbitmq_1  |
td.rabbitmq_1  | Log files (may contain more information):
td.rabbitmq_1  |    /storage/log/rabbit@8ee1683721ab.log
td.rabbitmq_1  |    /storage/log/rabbit@8ee1683721ab-sasl.log
td.rabbitmq_1  |
td.rabbitmq_1  | {"init terminating in do_boot",{could_not_start,rabbit,{{case_clause,{error,{{shutdown,{failed_to_start_child,{ranch_listener_sup,{acceptor,{0,0,0,0,0,0,0,0},5671}},{shutdown,{failed_to_start_child,ranch_acceptors_sup,{{badmatch,false},[{ranch_ssl,listen,1,[{file,"src/ranch_ssl.erl"},{line,88}]},{ranch_acceptors_sup,init,1,[{file,"src/ranch_acceptors_sup.erl"},{line,35}]},{supervisor,init,1,[{file,"supervisor.erl"},{line,294}]},{gen_server,init_it,6,[{file,"gen_server.erl"},{line,328}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,247}]}]}}}}},{child,undefined,'rabbit_tcp_listener_sup_:::5671',{tcp_listener_sup,start_link,[{0,0,0,0,0,0,0,0},5671,ranch_ssl,[inet6,{backlog,128},{nodelay,true},{linger,{true,0}},{exit_on_close,false},{versions,['tlsv1.2','tlsv1.1',tlsv1]}],rabbit_connection_sup,[],{rabbit_networking,tcp_listener_started,['amqp/ssl']},{rabbit_networking,tcp_listener_stopped,['amqp/ssl']},1,"SSL Listener"]},transient,infinity,supervisor,[tcp_listener_sup]}}}},[{rabbit_networking,start_listener0,5,[{file,"src/rabbit_networking.erl"},{line,294}]},{rabbit_networking,'-start_listener/5-lc$^0/1-0-',5,[{file,"src/rabbit_networking.erl"},{line,282}]},{rabbit_networking,start_listener,5,[{file,"src/rabbit_networking.erl"},{line,282}]},{rabbit_networking,'-boot_ssl/1-lc$^0/1-0-',3,[{file,"src/rabbit_networking.erl"},{line,142}]},{rabbit_networking,boot_ssl,1,[{file,"src/rabbit_networking.erl"},{line,142}]},{rabbit_networking,boot,0,[{file,"src/rabbit_networking.erl"},{line,128}]},{rabbit_boot_steps,'-run_step/2-lc$^1/1-1-',1,[{file,"src/rabbit_boot_steps.erl"},{line,49}]},{rabbit_boot_steps,run_step,2,[{file,"src/rabbit_boot_steps.erl"},{line,49}]}]}}}
td.rabbitmq_1  |
td.rabbitmq_1  | Crash dump is being written to: erl_crash.dump...done
td.rabbitmq_1  | init terminating in do_boot ()
td.rabbitmq_1  | Preboot complete, starting rabbitmq...
td.rabbitmq_1  | Hostname: 8ee1683721ab, RabbitMQ Nodename: rabbit@8ee1683721ab
td.rabbitmq_1  | [INFO] RabbitMQ Started (PID: 17, ServerPid: 17)!
td.rabbitmq_1  |              [{file,"src/rabbit_networking.erl"},{line,128}]},
td.rabbitmq_1  |          {rabbit_boot_steps,'-run_step/2-lc$^1/1-1-',1,
td.rabbitmq_1  |              [{file,"src/rabbit_boot_steps.erl"},{line,49}]},
td.rabbitmq_1  |          {rabbit_boot_steps,run_step,2,
td.rabbitmq_1  |              [{file,"src/rabbit_boot_steps.erl"},{line,49}]}]}}
td.rabbitmq_1  |
td.rabbitmq_1  | Log files (may contain more information):
td.rabbitmq_1  |    /storage/log/rabbit@8ee1683721ab.log
td.rabbitmq_1  |    /storage/log/rabbit@8ee1683721ab-sasl.log
Stono commented

@xek if I do a 'stop_app' 'start_app' on the server, it all works fine the second time... regardless of having the 5671 listener or not... so seems like the initial boot, the ssl app isn't being started!

xek commented

I think the line that it fails on is

true = lists:keymember(cert, 1, Opts)
                orelse lists:keymember(certfile, 1, Opts),

So it looks like it's failing because you don't have cert or certfile config defined

Stono commented

@xek Yeah i figured that, added the certs, server boots, but now i'm back where I am before, and it requires me to manually stop and start the rabbitmq app for it to work (initial boot fails)

xek commented

So it still fails on the missing certificate config the first time it boots, and the second time it works?

Stono commented

Nooo, certificate issue is fixed.
See this log.

$ ki logs --follow rabbitmq-1483846031-jutq2
Using default rabbitmq configuration as no config found in /storage/rabbitmq.config
Using default rabbitmq definitions as no config found in /storage/rabbitmq_definitions.json
Generating certificates for TLS
NOTE: This is totally a hack, to get 5671 listening
which in turn fixes https://github.com/aweber/rabbitmq-autocluster/issues/92
Doing CA
Generating a 2048 bit RSA private key
............................................+++
......+++
writing new private key to './private/cakey.pem'
-----
Doing Server
Generating RSA private key, 2048 bit long modulus
...........................+++
.....................+++
e is 65537 (0x10001)
Using configuration from openssl.cnf
Check that the request matches the signature
Signature ok
The Subject's Distinguished Name is as follows
commonName            :ASN.1 12:'rabbitmq-1483846031-jutq2'
organizationName      :ASN.1 12:'server'
Certificate is to be certified until Sep  7 13:11:30 2017 GMT (365 days)

Write out database with 1 new entries
Data Base Updated
Preboot complete, starting rabbitmq...
Using ip address as hostname
Hostname: 10.20.1.14, RabbitMQ Nodename: rabbit@10.20.1.14
[INFO] RabbitMQ Started (PID: 32, ServerPid: 32)!

              RabbitMQ 3.6.5. Copyright (C) 2007-2016 Pivotal Software, Inc.
  ##  ##      Licensed under the MPL.  See http://www.rabbitmq.com/
  ##  ##
  ##########  Logs: /storage/log/rabbit@10.20.1.14.log
  ######  ##        /storage/log/rabbit@10.20.1.14-sasl.log
  ##########
              Starting broker...

=INFO REPORT==== 7-Sep-2016::13:11:33 ===
Starting RabbitMQ 3.6.5 on Erlang 19.0
Copyright (C) 2007-2016 Pivotal Software, Inc.
Licensed under the MPL.  See http://www.rabbitmq.com/

=INFO REPORT==== 7-Sep-2016::13:11:33 ===
node           : rabbit@10.20.1.14
home dir       : /home/rabbitmq
config file(s) : /storage/rabbitmq.config
cookie hash    : sjAQ44vvJUb2ijPGD0IntA==
log            : /storage/log/rabbit@10.20.1.14.log
sasl log       : /storage/log/rabbit@10.20.1.14-sasl.log
database dir   : /storage/mnesia/rabbit@10.20.1.14

=INFO REPORT==== 7-Sep-2016::13:11:40 ===
autocluster: Using k8s backend

=INFO REPORT==== 7-Sep-2016::13:11:40 ===
autocluster: Delaying startup for 8726ms.

=INFO REPORT==== 7-Sep-2016::13:11:49 ===
autocluster: Starting k8s registration.

=INFO REPORT==== 7-Sep-2016::13:11:49 ===
autocluster: GET https://kubernetes.default.svc.cluster.local:443/api/v1/namespaces/td-integration/endpoints/rabbitmq

=INFO REPORT==== 7-Sep-2016::13:11:49 ===
autocluster: Response: [{error,
                            {failed_connect,
                                [{to_address,
                                     {"kubernetes.default.svc.cluster.local",
                                      443}},
                                 {inet,[inet],ssl_not_started}]}}]

=INFO REPORT==== 7-Sep-2016::13:11:49 ===
autocluster: HTTP Error {failed_connect,
                            [{to_address,
                                 {"kubernetes.default.svc.cluster.local",443}},
                             {inet,[inet],ssl_not_started}]}

=INFO REPORT==== 7-Sep-2016::13:11:49 ===
ERROR: "autocluster: Failed to get nodes from k8s - ~s~n" - [{failed_connect,
                                                              [{to_address,
                                                                {"kubernetes.default.svc.cluster.local",
                                                                 443}},
                                                               {inet,
                                                                [inet],
                                                                ssl_not_started}]}]

=ERROR REPORT==== 7-Sep-2016::13:11:49 ===
autocluster: Could not fetch node list from autocluster_k8s: {failed_connect,
                                                              [{to_address,
                                                                {"kubernetes.default.svc.cluster.local",
                                                                 443}},
                                                               {inet,
                                                                [inet],
                                                                ssl_not_started}]}.

=INFO REPORT==== 7-Sep-2016::13:11:49 ===
Memory limit set to 2999MB of 7499MB total.

=INFO REPORT==== 7-Sep-2016::13:11:49 ===
Disk free limit set to 50MB

=INFO REPORT==== 7-Sep-2016::13:11:49 ===
Limiting to approx 924 file handles (829 sockets)

=INFO REPORT==== 7-Sep-2016::13:11:49 ===
FHC read buffering:  OFF
FHC write buffering: ON

=INFO REPORT==== 7-Sep-2016::13:11:49 ===
Database directory at /storage/mnesia/rabbit@10.20.1.14 is empty. Initialising from scratch...

=INFO REPORT==== 7-Sep-2016::13:11:49 ===
Priority queues enabled, real BQ is rabbit_variable_queue

=INFO REPORT==== 7-Sep-2016::13:11:49 ===
Management plugin: using rates mode 'basic'

=INFO REPORT==== 7-Sep-2016::13:11:49 ===
msg_store_transient: using rabbit_msg_store_ets_index to provide index

=INFO REPORT==== 7-Sep-2016::13:11:49 ===
msg_store_persistent: using rabbit_msg_store_ets_index to provide index

=WARNING REPORT==== 7-Sep-2016::13:11:49 ===
msg_store_persistent: rebuilding indices from scratch

=INFO REPORT==== 7-Sep-2016::13:11:49 ===
Applying definitions from: /storage/rabbitmq_definitions.json

=INFO REPORT==== 7-Sep-2016::13:11:49 ===
Creating user 'guest'

=INFO REPORT==== 7-Sep-2016::13:11:49 ===
Changing password for 'guest'

=INFO REPORT==== 7-Sep-2016::13:11:49 ===
Setting user tags for user 'guest' to [administrator]

=INFO REPORT==== 7-Sep-2016::13:11:49 ===
Adding vhost '/'

=INFO REPORT==== 7-Sep-2016::13:11:49 ===
Setting permissions for 'guest' in '/' to '.*', '.*', '.*'

=INFO REPORT==== 7-Sep-2016::13:11:49 ===
started TCP Listener on [::]:5672

=INFO REPORT==== 7-Sep-2016::13:11:49 ===
started SSL Listener on [::]:5671

=ERROR REPORT==== 7-Sep-2016::13:11:49 ===
Failed to retrieve AWS credentials: undefined

=INFO REPORT==== 7-Sep-2016::13:11:49 ===
autocluster: (cleanup) Timer started {60,false}

=INFO REPORT==== 7-Sep-2016::13:11:49 ===
Management plugin started. Port: 15672

=INFO REPORT==== 7-Sep-2016::13:11:49 ===
Statistics event collector started.

=INFO REPORT==== 7-Sep-2016::13:11:49 ===
Statistics channel stats collector started.

=INFO REPORT==== 7-Sep-2016::13:11:49 ===
Statistics queue stats collector started.

=INFO REPORT==== 7-Sep-2016::13:11:49 ===
Statistics database started.

=INFO REPORT==== 7-Sep-2016::13:11:49 ===
Statistics garbage collector started for table aggr_queue_stats_fine_stats with interval 5000.

=INFO REPORT==== 7-Sep-2016::13:11:49 ===
Statistics garbage collector started for table aggr_queue_stats_deliver_get with interval 5000.

=INFO REPORT==== 7-Sep-2016::13:11:49 ===
Statistics garbage collector started for table aggr_queue_stats_queue_msg_counts with interval 5000.

=INFO REPORT==== 7-Sep-2016::13:11:49 ===
Statistics garbage collector started for table aggr_queue_stats_queue_msg_rates with interval 5000.

=INFO REPORT==== 7-Sep-2016::13:11:49 ===
Statistics garbage collector started for table aggr_queue_stats_process_stats with interval 5000.

=INFO REPORT==== 7-Sep-2016::13:11:49 ===
Statistics garbage collector started for table aggr_queue_exchange_stats_fine_stats with interval 5000.

=INFO REPORT==== 7-Sep-2016::13:11:49 ===
Statistics garbage collector started for table aggr_vhost_stats_deliver_get with interval 5000.
 completed with 12 plugins.

=INFO REPORT==== 7-Sep-2016::13:11:49 ===
Server startup complete; 12 plugins started.
 * rabbitmq_management_visualiser
 * rabbitmq_shovel_management
 * rabbitmq_management
 * rabbitmq_web_dispatch
 * webmachine
 * mochiweb
 * autocluster
 * rabbitmq_aws
 * rabbitmq_shovel
 * rabbitmq_consistent_hash_exchange
 * rabbitmq_management_agent
 * amqp_client

then i do rabbitmqctl stop_app and rabbitmqctl start_app

=INFO REPORT==== 7-Sep-2016::13:11:49 ===
Statistics garbage collector started for table aggr_vhost_stats_fine_stats with interval 5000.

=INFO REPORT==== 7-Sep-2016::13:11:49 ===
Statistics garbage collector started for table aggr_vhost_stats_queue_msg_rates with interval 5000.

=INFO REPORT==== 7-Sep-2016::13:11:49 ===
Statistics garbage collector started for table aggr_vhost_stats_queue_msg_counts with interval 5000.

=INFO REPORT==== 7-Sep-2016::13:11:49 ===
Statistics garbage collector started for table aggr_vhost_stats_coarse_conn_stats with interval 5000.

=INFO REPORT==== 7-Sep-2016::13:11:49 ===
Statistics garbage collector started for table aggr_channel_queue_stats_deliver_get with interval 5000.

=INFO REPORT==== 7-Sep-2016::13:11:49 ===
Statistics garbage collector started for table aggr_channel_queue_stats_fine_stats with interval 5000.

=INFO REPORT==== 7-Sep-2016::13:11:49 ===
Statistics garbage collector started for table aggr_channel_queue_stats_queue_msg_counts with interval 5000.

=INFO REPORT==== 7-Sep-2016::13:11:49 ===
Statistics garbage collector started for table aggr_channel_stats_deliver_get with interval 5000.

=INFO REPORT==== 7-Sep-2016::13:11:49 ===
Statistics garbage collector started for table aggr_channel_stats_fine_stats with interval 5000.

=INFO REPORT==== 7-Sep-2016::13:11:49 ===
Statistics garbage collector started for table aggr_channel_stats_queue_msg_counts with interval 5000.

=INFO REPORT==== 7-Sep-2016::13:11:49 ===
Statistics garbage collector started for table aggr_channel_stats_process_stats with interval 5000.

=INFO REPORT==== 7-Sep-2016::13:11:49 ===
Statistics garbage collector started for table aggr_channel_exchange_stats_deliver_get with interval 5000.

=INFO REPORT==== 7-Sep-2016::13:11:49 ===
Statistics garbage collector started for table aggr_channel_exchange_stats_fine_stats with interval 5000.

=INFO REPORT==== 7-Sep-2016::13:11:49 ===
Statistics garbage collector started for table aggr_exchange_stats_fine_stats with interval 5000.

=INFO REPORT==== 7-Sep-2016::13:11:49 ===
Statistics garbage collector started for table aggr_node_stats_coarse_node_stats with interval 5000.

=INFO REPORT==== 7-Sep-2016::13:11:49 ===
Statistics garbage collector started for table aggr_node_node_stats_coarse_node_node_stats with interval 5000.

=INFO REPORT==== 7-Sep-2016::13:11:49 ===
Statistics garbage collector started for table aggr_connection_stats_coarse_conn_stats with interval 5000.

=INFO REPORT==== 7-Sep-2016::13:11:49 ===
Statistics garbage collector started for table aggr_connection_stats_process_stats with interval 5000.

=INFO REPORT==== 7-Sep-2016::13:11:49 ===
Statistics garbage collector started for table channel_stats with interval 5000.

=INFO REPORT==== 7-Sep-2016::13:11:49 ===
Statistics garbage collector started for table connection_stats with interval 5000.

=INFO REPORT==== 7-Sep-2016::13:12:49 ===
autocluster: (cleanup) checking cluster

=INFO REPORT==== 7-Sep-2016::13:12:49 ===
autocluster: (cleanup) Checking for partitioned nodes.

=INFO REPORT==== 7-Sep-2016::13:12:49 ===
autocluster: (cleanup) No partitioned nodes found.

=INFO REPORT==== 7-Sep-2016::13:13:32 ===
Stopping RabbitMQ

=INFO REPORT==== 7-Sep-2016::13:13:32 ===
stopped SSL Listener on [::]:5671

=INFO REPORT==== 7-Sep-2016::13:13:32 ===
stopped TCP Listener on [::]:5672

=INFO REPORT==== 7-Sep-2016::13:13:32 ===
Stopped RabbitMQ application

=INFO REPORT==== 7-Sep-2016::13:13:39 ===
Starting RabbitMQ 3.6.5 on Erlang 19.0
Copyright (C) 2007-2016 Pivotal Software, Inc.
Licensed under the MPL.  See http://www.rabbitmq.com/

              RabbitMQ 3.6.5. Copyright (C) 2007-2016 Pivotal Software, Inc.
  ##  ##      Licensed under the MPL.  See http://www.rabbitmq.com/
  ##  ##
  ##########  Logs: /storage/log/rabbit@10.20.1.14.log
  ######  ##        /storage/log/rabbit@10.20.1.14-sasl.log
  ##########
              Starting broker...

=INFO REPORT==== 7-Sep-2016::13:13:39 ===
node           : rabbit@10.20.1.14
home dir       : /home/rabbitmq
config file(s) : /storage/rabbitmq.config
cookie hash    : sjAQ44vvJUb2ijPGD0IntA==
log            : /storage/log/rabbit@10.20.1.14.log
sasl log       : /storage/log/rabbit@10.20.1.14-sasl.log
database dir   : /storage/mnesia/rabbit@10.20.1.14

=INFO REPORT==== 7-Sep-2016::13:13:39 ===
autocluster: Using k8s backend

=INFO REPORT==== 7-Sep-2016::13:13:39 ===
autocluster: Delaying startup for 10864ms.

=INFO REPORT==== 7-Sep-2016::13:13:50 ===
autocluster: Starting k8s registration.

=INFO REPORT==== 7-Sep-2016::13:13:50 ===
autocluster: GET https://kubernetes.default.svc.cluster.local:443/api/v1/namespaces/td-integration/endpoints/rabbitmq

=INFO REPORT==== 7-Sep-2016::13:13:50 ===
autocluster: Response: [{ok,{{"HTTP/1.1",200,"OK"},
                             [{"date","Wed, 07 Sep 2016 13:13:50 GMT"},
                              {"content-length","978"},
                              {"content-type","application/json"}],
                             "{\"kind\":\"Endpoints\",\"apiVersion\":\"v1\",\"metadata\":{\"name\":\"rabbitmq\",\"namespace\":\"td-integration\",\"selfLink\":\"/api/v1/namespaces/td-integration/endpoints/rabbitmq\",\"uid\":\"9382d130-74f3-11e6-ba1b-42010af000a6\",\"resourceVersion\":\"894811\",\"creationTimestamp\":\"2016-09-07T12:06:58Z\",\"labels\":{\"app\":\"rabbitmq\",\"tier\":\"messaging\"}},\"subsets\":[{\"addresses\":[{\"ip\":\"10.20.1.14\",\"targetRef\":{\"kind\":\"Pod\",\"namespace\":\"td-integration\",\"name\":\"rabbitmq-1483846031-jutq2\",\"uid\":\"9652769a-74fc-11e6-ba1b-42010af000a6\",\"resourceVersion\":\"894789\"}},{\"ip\":\"10.20.1.17\",\"targetRef\":{\"kind\":\"Pod\",\"namespace\":\"td-integration\",\"name\":\"rabbitmq-1483846031-y91i6\",\"uid\":\"9663c6d4-74fc-11e6-ba1b-42010af000a6\",\"resourceVersion\":\"894791\"}}],\"ports\":[{\"name\":\"amqp\",\"port\":5672,\"protocol\":\"TCP\"},{\"name\":\"amqp-backend\",\"port\":25672,\"protocol\":\"TCP\"},{\"name\":\"amqp-old\",\"port\":5671,\"protocol\":\"TCP\"},{\"name\":\"erlang\",\"port\":4369,\"protocol\":\"TCP\"},{\"name\":\"management\",\"port\":15672,\"protocol\":\"TCP\"}]}]}\n"}}]

=INFO REPORT==== 7-Sep-2016::13:13:50 ===
autocluster: Discovered ['rabbit@10.20.1.14','rabbit@10.20.1.17']

=INFO REPORT==== 7-Sep-2016::13:13:50 ===
autocluster: Joining the cluster.

=INFO REPORT==== 7-Sep-2016::13:13:50 ===
Resetting Rabbit

=INFO REPORT==== 7-Sep-2016::13:13:50 ===
Clustering with ['rabbit@10.20.1.17'] as disc node

=INFO REPORT==== 7-Sep-2016::13:13:51 ===
autocluster: Cluster joined.

=INFO REPORT==== 7-Sep-2016::13:13:51 ===
Memory limit set to 2999MB of 7499MB total.

=INFO REPORT==== 7-Sep-2016::13:13:51 ===
Disk free limit set to 50MB

=INFO REPORT==== 7-Sep-2016::13:13:51 ===
Limiting to approx 924 file handles (829 sockets)

=INFO REPORT==== 7-Sep-2016::13:13:51 ===
FHC read buffering:  OFF
FHC write buffering: ON

=INFO REPORT==== 7-Sep-2016::13:13:51 ===
Management plugin: using rates mode 'basic'

=INFO REPORT==== 7-Sep-2016::13:13:51 ===
msg_store_transient: using rabbit_msg_store_ets_index to provide index

=INFO REPORT==== 7-Sep-2016::13:13:51 ===
msg_store_persistent: using rabbit_msg_store_ets_index to provide index

=WARNING REPORT==== 7-Sep-2016::13:13:51 ===
msg_store_persistent: rebuilding indices from scratch

=INFO REPORT==== 7-Sep-2016::13:13:51 ===
Applying definitions from: /storage/rabbitmq_definitions.json

=INFO REPORT==== 7-Sep-2016::13:13:51 ===
Changing password for 'guest'

=INFO REPORT==== 7-Sep-2016::13:13:51 ===
Setting user tags for user 'guest' to [administrator]

=INFO REPORT==== 7-Sep-2016::13:13:51 ===
Setting permissions for 'guest' in '/' to '.*', '.*', '.*'

=INFO REPORT==== 7-Sep-2016::13:13:51 ===
started TCP Listener on [::]:5672

=INFO REPORT==== 7-Sep-2016::13:13:51 ===
started SSL Listener on [::]:5671

=INFO REPORT==== 7-Sep-2016::13:13:51 ===
rabbit on node 'rabbit@10.20.1.17' up

=ERROR REPORT==== 7-Sep-2016::13:13:51 ===
Failed to retrieve AWS credentials: undefined

=INFO REPORT==== 7-Sep-2016::13:13:51 ===
autocluster: (cleanup) Timer started {60,false}

=INFO REPORT==== 7-Sep-2016::13:13:51 ===
Management plugin started. Port: 15672
 completed with 12 plugins.

=INFO REPORT==== 7-Sep-2016::13:13:51 ===
Server startup complete; 12 plugins started.
 * rabbitmq_management_visualiser
 * rabbitmq_shovel_management
 * rabbitmq_management
 * rabbitmq_web_dispatch
 * webmachine
 * mochiweb
 * autocluster
 * rabbitmq_aws
 * rabbitmq_shovel
 * rabbitmq_consistent_hash_exchange
 * rabbitmq_management_agent
 * amqp_client
Stono commented

So it seems like, on the first boot, when the plugin starts, the ssl vm isn't active. Then, on the second app start, something else has already kicked it into play, so it works.

xek commented

So, just a suggestion, because it looks like maybe some sort of race condition, maybe try lengthening the time in autocluster {startup_delay, 5}?

Stono commented

@xek tried that, up to 60 seconds. so it's not race, ssl isn't being started by autocluster successfully, but something else is starting it (probably the 5671 listener).

Stono commented

@xek I think I have something, check this out.
https://github.com/peopledata/td.rabbitmq/commit/8bc92e8856825cf88ceb0ead91c65940fd1e6fcd

That commit fixes the problem?! It seems by using a https://github.com/peopledata/td.rabbitmq/blob/master/rabbitmq_definitions.json we introduce some sort of race condition?

Definitions are loaded by one of the plugins (rabbitmq_management) and that is not coordinated in any way with this plugin, so that could be a race condition but I see no real evidence of that.

Stono commented

All I can say is that when using load_definitions autocluster breaks, with ssl_not_started error.
If I manually restart the rabbitmq app once all the plugins are loaded, the second time round, autocluster works.

Feels racy to me, and introduced when loading definitions from a file, rather than defaults.

Definitions can include things such as dynamic Shovels and lead to natural race conditions such as rabbitmq/rabbitmq-shovel#13. I don't know what may be going on that affects this particular plugin/backend, so this is just one known example.

Stono commented

So, I couldn't figure out a way to get this working at all, so I've implemented the dirties hack ever. See https://github.com/peopledata/td.rabbitmq/commit/0ff9c9c37a6288dc59a3ae5477a4da85b00fb51c

Basically I:

  1. start rabbitmq
  2. tail the log until 'Server startup complete' appears (ie all plugins are loaded)
  3. stop rabbitmq
  4. enable the autocluster plugin
  5. start rabbitmq

Disgusting, but the only way I could get it working

@Stono are you aware of rabbitmqctl wait <pid path>?

Stono commented

@michaelklishin nope, I've just switched to that, much nicer than tailing the file and is fired at the same point (once all plugins are complete and server is up). Thanks!

somwhat related: deadtrickster/prometheus_rabbitmq_exporter#12
(autocluster and prometheus_rabbitmq_exporter on kubernetes make rabbitmq segfault)

@janwillies RabbitMQ has no native code and neither do those two plugins. Switch to a more recent Erlang release and make sure it's 64 bit.

It is definitely related to plugin startup order, for example, in my case it works (without stop and start) if plugin startup order is following:

=INFO REPORT==== 3-Jan-2017::13:07:12 ===
Server startup complete; 8 plugins started.
 * rabbitmq_delayed_message_exchange
 * rabbitmq_management
 * rabbitmq_web_dispatch
 * webmachine
 * rabbitmq_management_agent
 * autocluster
 * mochiweb
 * amqp_client

and following order is not working:

=INFO REPORT==== 3-Jan-2017::13:13:29 ===
Server startup complete; 8 plugins started.
 * rabbitmq_management
 * rabbitmq_delayed_message_exchange
 * rabbitmq_web_dispatch
 * webmachine
 * autocluster
 * mochiweb
 * rabbitmq_management_agent
 * amqp_client

It seems that rabbitmq_management_agent is starting ssl.
I do not have ssl certs configured.

I tried @xek's recommendation: application:ensure_all_started(ssl).
so far it is working fine. @Stono could you try following patch:

diff --git a/src/autocluster_k8s.erl b/src/autocluster_k8s.erl
index e68b050..7d0823b 100644
--- a/src/autocluster_k8s.erl
+++ b/src/autocluster_k8s.erl
@@ -25,6 +25,7 @@
 %% @end
 %%
 nodelist() ->
+    {ok, _} = application:ensure_all_started(ssl),
     case make_request() of
 	{ok, Response} ->
 	    Addresses = extract_node_list(Response),
Stono commented

@tarvip sorry for the delay, your patch worked nicely thank you

Stono commented

@tarvip actually I lied, it didn't. :|

Thanks @tarvip for your patch, it works great on my custom build of this plugin and I'm looking forward to seeing it included in the next official release. You should submit a PR with your fix!

gmr commented

This plugin was forked by the RabbitMQ team and is now part of RabbitMQ. More information can be found @ https://github.com/rabbitmq/rabbitmq-autocluster