xing/beetle

How the RCC-server copes with the fall of the rabbit-server (cluster)?

Closed this issue · 6 comments

  • Software:
    • rabbitmq-server: 2.6.1-1
    • redis-server: 2:2.4.2-1~bpo60+1
    • beetle: 0.3.0.rc.11

I have three servers with ip-addresses: 192.168.111.170 (www1), 192.168.111.171 (www2), 192.168.111.172 (www3).

On the server WWW1:
service redis-server start
service rabbitmq-server start
rabbitmqctl cluster_status
Cluster status of node rabbit@www1 ...
[{nodes,[{disc,[rabbit@www1,rabbit@www3,rabbit@www2]}]},
{running_nodes,[rabbit@www3,rabbit@www2,rabbit@www1]}]
...done.

beetle configuration_server start -- --verbose --redis-servers 192.168.111.170:6379,192.168.111.171:6379,192.168.111.172:6379 --redis-master-file /root/beetle-redis-master-file --amqp-servers 192.168.111.170:5672,192.168.111.171:5672,192.168.111.172:5672 --pid-dir /var/log/beetle

beetle configuration_client start -- --verbose --client-id rcc-1 --redis-master-file /root/beetle-redis-master-file --amqp-servers 192.168.111.170:5672,192.168.111.171:5672,192.168.111.172:5672 --pid-dir /var/log/beetle

Log:
I, [2012-02-01 14:59:31#10500]  INFO -- : Beetle: connecting to rabbit 192.168.111.170:5672
I, [2012-02-01 14:59:31#10500]  INFO -- : Beetle: connecting to rabbit 192.168.111.171:5672
I, [2012-02-01 14:59:31#10500]  INFO -- : Beetle: connecting to rabbit 192.168.111.172:5672

On the server WWW2:
service redis-server start
service rabbitmq-server start

beetle configuration_client start -- --verbose --client-id rcc-2 --redis-master-file /root/beetle-redis-master-file --amqp-servers 192.168.111.170:5672,192.168.111.171:5672,192.168.111.172:5672 --pid-dir /var/log/beetle

On the server WWW3:
service redis-server start
service rabbitmq-server start

beetle configuration_client start -- --verbose --client-id rcc-3 --redis-master-file /root/beetle-redis-master-file --amqp-servers 192.168.111.170:5672,192.168.111.171:5672,192.168.111.172:5672 --pid-dir /var/log/beetle

Everything works, including all of the RCC-processes. They receive messages from the RCS-process.

Strange scenario:

WWW1: service rabbitmq-server stop
RCC1 Log:
W, [2012-02-01 15:04:28#10500] WARN -- : Beetle: lost connection: 192.168.111.170:5672. reconnecting.
W, [2012-02-01 15:04:28#10500] WARN -- : Beetle: lost connection: 192.168.111.170:5672. reconnecting.

RSS1 Log:
D, [2012-02-01 15:05:41#10483] DEBUG -- : Beetle: sending reconfigure
D, [2012-02-01 15:05:41#10483] DEBUG -- : Beetle: trying to send message reconfigure:ae545770-4cc4-11e1-a7bb-97834f7db210 to 192.168.111.170:5672
W, [2012-02-01 15:05:41#10483] WARN -- : Beetle: error closing down bunny Connection refused - connect(2)
D, [2012-02-01 15:05:41#10483] DEBUG -- : Beetle: trying to send message reconfigure:ae545770-4cc4-11e1-a7bb-97834f7db210 to 192.168.111.170:5672
W, [2012-02-01 15:05:41#10483] WARN -- : Beetle: error closing down bunny Connection refused - connect(2)
I, [2012-02-01 15:05:41#10483] INFO -- : Beetle: server 192.168.111.170:5672 down: Connection refused - connect(2)
D, [2012-02-01 15:05:41#10483] DEBUG -- : Beetle: trying to send message reconfigure:ae545770-4cc4-11e1-a7bb-97834f7db210 to 192.168.111.171:5672
D, [2012-02-01 15:05:41#10483] DEBUG -- : Beetle: message sent!
....
I, [2012-02-01 15:06:52#10483] INFO -- : Publishing reconfigure message with server '192.168.111.171:6379'
D, [2012-02-01 15:06:52#10483] DEBUG -- : Beetle: sending reconfigure
D, [2012-02-01 15:06:52#10483] DEBUG -- : Beetle: trying to send message reconfigure:d82c46b6-4cc4-11e1-a58d-1b3141ff5712 to 192.168.111.172:5672
D, [2012-02-01 15:06:52#10483] DEBUG -- : Beetle: message sent!

RCC2:
W, [2012-02-01 15:07:39#9306] WARN -- : Beetle: lost connection: 192.168.111.170:5672. reconnecting.
W, [2012-02-01 15:07:49#9306] WARN -- : Beetle: lost connection: 192.168.111.170:5672. reconnecting.

RCC3:
W, [2012-02-01 15:08:19#5026] WARN -- : Beetle: lost connection: 192.168.111.170:5672. reconnecting.
D, [2012-02-01 15:08:22#5026] DEBUG -- : Beetle: processing message msgid:system_reconfigure_rcc-3:0df9069e-4cc5-11e1-b599-c3fcbcee9bd7
D, [2012-02-01 15:08:22#5026] DEBUG -- : Beetle: ack! for message msgid:system_reconfigure_rcc-3:0df9069e-4cc5-11e1-b599-c3fcbcee9bd7
I, [2012-02-01 15:08:22#5026] INFO -- : Received reconfigure message with server '192.168.111.171:6379' and token '1328093960543'
D, [2012-02-01 15:08:22#5026] DEBUG -- : Beetle: message processing completed

So, what am I doing wrong?

The documentation was written:
beetle configuration_client start -- --help
--amqp-servers LIST AMQP server list (e.g. 192.168.0.1:5672,192.168.0.2:5672)

Do I understand that if stop the first AMQP-server, then the RCC / RCS should start receiving messages from the other?

Hi,

I noticed something in your setup what might cause a problem.

First, as far as I understand, you're trying to use beetle together with rabbitmq cluster. We have never tried to do that. We just set up single node rabbitmq servers, which are completely independent.

That said, in theory, the whole system could work nonetheless. But we never tested such a configuration.

Second, I've noticed you're running the RCS and the RCCs on your rabbit machines. If your rabbit servers are also the machines where you run your message processors, then that's fine. If you have additional worker machines, then you also need to run a RCC instance on each of them. In general, running the RCCs on the rabbit servers is problematic, because if one machine has a hard failure, it means that three components fail at the same time: a rabbit, a redis, and a RCC process. We have designed the system to be tolerant to single node failures only. So it's strongly recommended that you separate the failover system and the redis server from your rabbits.

Third, you're using three redis instances, which also run on the rabbit servers. You only need two redis servers. Again, it could also work with three redis servers, the code supports it, but we never tested it.

Back to your problem:

the reconfigure message (and all other messages of the failover protocol) are not sent redundantly, as this would mean they couldn't be processed anymore when the redis master server dies, causing the whole system to deadlock. What happens is that they are sent to all three rabbits in a round robin fashion, taking dead servers into account. You will see error messages in the logs for failed message sending attempts, but this does not mean it isn't working. rabbit servers which are not reachable will be removed from the round robin list for 10 seconds and then added again.

So nothing in the logs tells me that there is something wrong, except that the RCC2 log has only "reconnecting" entries. But I assume you shortened it, so it's hard to tell if this is a problem.

RC11 comes with a http server built into the RCS process, which you can use to check whether the setup is correct. It runs on port 8080 on the RCS machine. It has a REST-API. Can you put your system into the initial state, where evything is running, and paste the output of 'curl http://www1:8080/.txt' here?

HTH,

-- stefan

curl http://192.168.111.170:8080/.txt

beetle_version: 0.3.0.rc.11
configured_brokers: 192.168.111.170:5672, 192.168.111.171:5672, 192.168.111.172:5672
configured_client_ids:
configured_redis_servers: 192.168.111.170:6379, 192.168.111.171:6379, 192.168.111.172:6379
redis_master: 192.168.111.170:6379
redis_master_available?: true
redis_slaves_available: 192.168.111.171:6379, 192.168.111.172:6379
switch_in_progress: false

In the production will be using beetle as you wrote. At the moment, simply I spend experimenting. In the beetle failover would like to see the logic of this Rabbit, Riak... and tolerance of the serial equipment failures. A single standby server it is very good, but it is adding a second standby server, practically excludes downtime.

Beetle is a good solution, thank you very much.

I see that you don't have any client_ids configured on the RCS. Currently the ids need to be statically configured in the RCS config file, or passed as an argument to the RCS.

beetle configuration_server start -- --verbose --redis-servers 192.168.111.170:6379,192.168.111.171:6379,192.168.111.172:6379 --redis-master-file /root/beetle-redis-master-file --amqp-servers 192.168.111.170:5672,192.168.111.171:5672,192.168.111.172:5672 --pid-dir /var/log/beetle --client-ids rcc-1,rcc-2,rcc-3

We decided on a statically defined list of clients, because auto-detecting the clients would have introduced another set of failure possibilities.

If I use the client-ids, then rcs-server will not automatically change the role in the case of server failure. Or not?

https://github.com/xing/beetle/blob/master/lib/beetle/redis_configuration_server.rb#L128
if @client_ids.empty?
switch_master
else
start_invalidation
end

For example, the first server I rebooted and wait 20 minutes, nothing happened.

If there are no client ids configured, so that @client_ids.empty? == true, then the master will be switched right away, without going through the client invalidation phase first. This is because when there are no clients, nobody has to "agree" to the master switch and we can just do it.

When you have no client ids (RCCs) configured, you will hence see no messages being received on them. But you should still see the master switch related log entries on the RCS.

I think we can close this ticket.