basho-labs/riak-chef-cookbook

Chef puts the node in a bad state if riak cookie name/address change

Closed this issue · 2 comments

Should node['riak']['args']['-name'] change between runs, Chef will put the node in a bad state. To repro, replace #{node['ipaddress']} with 127.0.0.1 in attributes/default.rb

[Fri, 24 Aug 2012 14:18:17 -0400] ERROR: service[riak] (riak::default line 155) has had an error
[Fri, 24 Aug 2012 14:18:17 -0400] ERROR: Running exception handlers
[Fri, 24 Aug 2012 14:18:17 -0400] ERROR: Exception handlers complete
[Fri, 24 Aug 2012 14:18:17 -0400] FATAL: Stacktrace dumped to /tmp/vagrant-chef-1/chef-stacktrace.out
[Fri, 24 Aug 2012 14:18:17 -0400] FATAL: Chef::Exceptions::Exec: service[riak] (riak::default line 155) had an error: /sbin/service riak restart returned 1, expected 0
Chef never successfully completed! Any errors should be visible in the
output above. Please fix your recipes so that they properly complete.

Erlang doesn't die during the restart:

ps -u riak
  PID TTY          TIME CMD
 2366 ?        00:00:00 epmd
 2588 ?        00:00:00 run_erl
 2589 pts/0    00:00:19 beam
 2715 ?        00:00:00 sh
 2716 ?        00:00:00 memsup
 2717 ?        00:00:00 cpu_sup

The riak console gives this error:

sudo riak console
Attempting to restart script through sudo -H -u riak
Exec: /usr/lib64/riak/erts-5.9.1/bin/erlexec -boot /usr/lib64/riak/releases/1.2.0/riak             -embedded -config /etc/riak/app.config             -pa /usr/lib64/riak/basho-patches             -args_file /etc/riak/vm.args -- console
Root: /usr/lib64/riak
{error_logger,{{2012,8,24},{14,20,55}},"Protocol: ~p: register error: ~p~n",["inet_tcp",{{badmatch,{error,duplicate_name}},[{inet_tcp_dist,listen,1,[{file,"inet_tcp_dist.erl"},{line,70}]},{net_kernel,start_protos,4,[{file,"net_kernel.erl"},{line,1314}]},{net_kernel,start_protos,3,[{file,"net_kernel.erl"},{line,1307}]},{net_kernel,init_node,2,[{file,"net_kernel.erl"},{line,1197}]},{net_kernel,init,1,[{file,"net_kernel.erl"},{line,357}]},{gen_server,init_it,6,[{file,"gen_server.erl"},{line,304}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]}]}
{error_logger,{{2012,8,24},{14,20,55}},crash_report,[[{initial_call,{net_kernel,init,['Argument__1']}},{pid,<0.20.0>},{registered_name,[]},{error_info,{exit,{error,badarg},[{gen_server,init_it,6,[{file,"gen_server.erl"},{line,320}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]}},{ancestors,[net_sup,kernel_sup,<0.10.0>]},{messages,[]},{links,[#Port<0.195>,<0.17.0>]},{dictionary,[{longnames,true}]},{trap_exit,true},{status,running},{heap_size,610},{stack_size,24},{reductions,681}],[]]}
{error_logger,{{2012,8,24},{14,20,55}},supervisor_report,[{supervisor,{local,net_sup}},{errorContext,start_error},{reason,{'EXIT',nodistribution}},{offender,[{pid,undefined},{name,net_kernel},{mfargs,{net_kernel,start_link,[['riak@127.0.0.1',longnames]]}},{restart_type,permanent},{shutdown,2000},{child_type,worker}]}]}
{error_logger,{{2012,8,24},{14,20,55}},supervisor_report,[{supervisor,{local,kernel_sup}},{errorContext,start_error},{reason,shutdown},{offender,[{pid,undefined},{name,net_sup},{mfargs,{erl_distribution,start_link,[]}},{restart_type,permanent},{shutdown,infinity},{child_type,supervisor}]}]}
{error_logger,{{2012,8,24},{14,20,55}},std_info,[{application,kernel},{exited,{shutdown,{kernel,start,[normal,[]]}}},{type,permanent}]}
{"Kernel pid terminated",application_controller,"{application_start_failure,kernel,{shutdown,{kernel,start,[normal,[]]}}}"}

Crash dump was written to: /var/log/riak/erl_crash.dump
Kernel pid terminated (application_controller) ({application_start_failure,kernel,{shutdown,{kernel,start,[normal,[]]}}})

This is definitely an issue, however, using riak console isn't an appropriate way to test the issue. The error message shown from riak console just indicates that Riak is still running; the error message is expected and correct.

I don't know specifically why riak restart returned a non-zero exit code but changing the name or cookie of a Riak node needs to be done carefully.

Name changes

The name of a Riak node is it's identifier in the cluster and all cluster operations depend on the name of a node never changing. Prior to Riak 1.2.0 there was a riak-admin reip command that could be used in a name change scenario but it didn't do the right thing. Riak 1.2.0 introduced new cluster operations that do the right thing.

In the case where the name changes I believe the correct behavior is the following:

  1. Stop Riak before updating vm.args
  2. Start Riak after updating vm.args (I don't think a restart works correctly with name changes but this should be tested)
  3. Force replace the old name with the new name
riak-admin cluster force-replace <old_name> <new_name>

Cookie changes

All nodes in the cluster must have the same cookie value in order to communicate. If any node in the cluster has a different value it will not be able to communicate with the other nodes in the cluster. It will continue to accept requests but will act as if the rest of the cluster is down. If all nodes in the cluster are updated to the same cookie and restarted the cluster will eventually return to normal operation.

As with the name change, I don't believe a restart will work. The node will have to be stopped before changing vm.args and started after changing vm.args.

Closing for now as this is expected behavior and not necessarily germain to what Chef needs to handle.