Chef puts the node in a bad state if riak cookie name/address change
Closed this issue · 2 comments
Should node['riak']['args']['-name'] change between runs, Chef will put the node in a bad state. To repro, replace #{node['ipaddress']} with 127.0.0.1 in attributes/default.rb
[Fri, 24 Aug 2012 14:18:17 -0400] ERROR: service[riak] (riak::default line 155) has had an error
[Fri, 24 Aug 2012 14:18:17 -0400] ERROR: Running exception handlers
[Fri, 24 Aug 2012 14:18:17 -0400] ERROR: Exception handlers complete
[Fri, 24 Aug 2012 14:18:17 -0400] FATAL: Stacktrace dumped to /tmp/vagrant-chef-1/chef-stacktrace.out
[Fri, 24 Aug 2012 14:18:17 -0400] FATAL: Chef::Exceptions::Exec: service[riak] (riak::default line 155) had an error: /sbin/service riak restart returned 1, expected 0
Chef never successfully completed! Any errors should be visible in the
output above. Please fix your recipes so that they properly complete.
Erlang doesn't die during the restart:
ps -u riak
PID TTY TIME CMD
2366 ? 00:00:00 epmd
2588 ? 00:00:00 run_erl
2589 pts/0 00:00:19 beam
2715 ? 00:00:00 sh
2716 ? 00:00:00 memsup
2717 ? 00:00:00 cpu_sup
The riak console gives this error:
sudo riak console
Attempting to restart script through sudo -H -u riak
Exec: /usr/lib64/riak/erts-5.9.1/bin/erlexec -boot /usr/lib64/riak/releases/1.2.0/riak -embedded -config /etc/riak/app.config -pa /usr/lib64/riak/basho-patches -args_file /etc/riak/vm.args -- console
Root: /usr/lib64/riak
{error_logger,{{2012,8,24},{14,20,55}},"Protocol: ~p: register error: ~p~n",["inet_tcp",{{badmatch,{error,duplicate_name}},[{inet_tcp_dist,listen,1,[{file,"inet_tcp_dist.erl"},{line,70}]},{net_kernel,start_protos,4,[{file,"net_kernel.erl"},{line,1314}]},{net_kernel,start_protos,3,[{file,"net_kernel.erl"},{line,1307}]},{net_kernel,init_node,2,[{file,"net_kernel.erl"},{line,1197}]},{net_kernel,init,1,[{file,"net_kernel.erl"},{line,357}]},{gen_server,init_it,6,[{file,"gen_server.erl"},{line,304}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]}]}
{error_logger,{{2012,8,24},{14,20,55}},crash_report,[[{initial_call,{net_kernel,init,['Argument__1']}},{pid,<0.20.0>},{registered_name,[]},{error_info,{exit,{error,badarg},[{gen_server,init_it,6,[{file,"gen_server.erl"},{line,320}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]}},{ancestors,[net_sup,kernel_sup,<0.10.0>]},{messages,[]},{links,[#Port<0.195>,<0.17.0>]},{dictionary,[{longnames,true}]},{trap_exit,true},{status,running},{heap_size,610},{stack_size,24},{reductions,681}],[]]}
{error_logger,{{2012,8,24},{14,20,55}},supervisor_report,[{supervisor,{local,net_sup}},{errorContext,start_error},{reason,{'EXIT',nodistribution}},{offender,[{pid,undefined},{name,net_kernel},{mfargs,{net_kernel,start_link,[['riak@127.0.0.1',longnames]]}},{restart_type,permanent},{shutdown,2000},{child_type,worker}]}]}
{error_logger,{{2012,8,24},{14,20,55}},supervisor_report,[{supervisor,{local,kernel_sup}},{errorContext,start_error},{reason,shutdown},{offender,[{pid,undefined},{name,net_sup},{mfargs,{erl_distribution,start_link,[]}},{restart_type,permanent},{shutdown,infinity},{child_type,supervisor}]}]}
{error_logger,{{2012,8,24},{14,20,55}},std_info,[{application,kernel},{exited,{shutdown,{kernel,start,[normal,[]]}}},{type,permanent}]}
{"Kernel pid terminated",application_controller,"{application_start_failure,kernel,{shutdown,{kernel,start,[normal,[]]}}}"}
Crash dump was written to: /var/log/riak/erl_crash.dump
Kernel pid terminated (application_controller) ({application_start_failure,kernel,{shutdown,{kernel,start,[normal,[]]}}})
This is definitely an issue, however, using riak console
isn't an appropriate way to test the issue. The error message shown from riak console
just indicates that Riak is still running; the error message is expected and correct.
I don't know specifically why riak restart
returned a non-zero exit code but changing the name or cookie of a Riak node needs to be done carefully.
Name changes
The name
of a Riak node is it's identifier in the cluster and all cluster operations depend on the name of a node never changing. Prior to Riak 1.2.0 there was a riak-admin reip
command that could be used in a name change scenario but it didn't do the right thing. Riak 1.2.0 introduced new cluster operations that do the right thing.
In the case where the name changes I believe the correct behavior is the following:
- Stop Riak before updating vm.args
- Start Riak after updating vm.args (I don't think a restart works correctly with name changes but this should be tested)
- Force replace the old name with the new name
riak-admin cluster force-replace <old_name> <new_name>
Cookie changes
All nodes in the cluster must have the same cookie value in order to communicate. If any node in the cluster has a different value it will not be able to communicate with the other nodes in the cluster. It will continue to accept requests but will act as if the rest of the cluster is down. If all nodes in the cluster are updated to the same cookie and restarted the cluster will eventually return to normal operation.
As with the name change, I don't believe a restart will work. The node will have to be stopped before changing vm.args
and started after changing vm.args
.
Closing for now as this is expected behavior and not necessarily germain to what Chef needs to handle.