edelight/chef-mongodb

mongdb::replicaset fails: config_replicaset silently fails when mongo server not ready yet

thomas-riccardi opened this issue · 2 comments

chef-mongodb v0.16.2

With include_recipe "mongodb::replicaset":

...
==> default: [2015-03-06T14:38:14+00:00] INFO: template[/etc/init.d/mongod] not queuing delayed action restart on service[mongod] (delayed), as it's already been queued
==> default:   * service[mongod] action enable
==> default: [2015-03-06T14:38:14+00:00] INFO: Processing service[mongod] action enable (mongodb::replicaset line 202)
==> default: [2015-03-06T14:38:14+00:00] DEBUG: providers for generic service resource enabled on node include: [Chef::Provider::Service::Init, Chef::Provider::Service::Redhat, Chef::Provider::Service::Upstart]
==> default: [2015-03-06T14:38:14+00:00] DEBUG: providers that refused resource service[mongod] were: [Chef::Provider::Service::Upstart]
==> default: [2015-03-06T14:38:14+00:00] DEBUG: providers that support resource service[mongod] include: [Chef::Provider::Service::Init, Chef::Provider::Service::Redhat]
==> default: [2015-03-06T14:38:14+00:00] DEBUG: providers that survived replacement include: [Chef::Provider::Service::Redhat]
==> default: [2015-03-06T14:38:14+00:00] DEBUG: service[mongod] supports status, running
==> default: [2015-03-06T14:38:14+00:00] DEBUG: service[mongod] already enabled - nothing to do
==> default:  (up to date)
==> default:   * service[mongod] action start
==> default: [2015-03-06T14:38:14+00:00] INFO: Processing service[mongod] action start (mongodb::replicaset line 202)
==> default: [2015-03-06T14:38:14+00:00] DEBUG: providers for generic service resource enabled on node include: [Chef::Provider::Service::Init, Chef::Provider::Service::Redhat, Chef::Provider::Service::Upstart]
==> default: [2015-03-06T14:38:14+00:00] DEBUG: providers that refused resource service[mongod] were: [Chef::Provider::Service::Upstart]
==> default: [2015-03-06T14:38:14+00:00] DEBUG: providers that support resource service[mongod] include: [Chef::Provider::Service::Init, Chef::Provider::Service::Redhat]
==> default: [2015-03-06T14:38:14+00:00] DEBUG: providers that survived replacement include: [Chef::Provider::Service::Redhat]
==> default: [2015-03-06T14:38:14+00:00] DEBUG: service[mongod] supports status, running
==> default: [2015-03-06T14:38:25+00:00] INFO: service[mongod] started
==> default:
==> default:     - start service service[mongod]
==> default: [2015-03-06T14:38:25+00:00] INFO: service[mongod] sending create action to ruby_block[config_replicaset] (immediate)
==> default:   * ruby_block[config_replicaset] action create[2015-03-06T14:38:25+00:00] INFO: Processing ruby_block[config_replicaset] action create (mongodb::replicaset line 225)
==> default: [2015-03-06T14:38:25+00:00] DEBUG: providers for generic ruby_block resource enabled on node include: [Chef::Provider::RubyBlock]
==> default: [2015-03-06T14:38:25+00:00] DEBUG: providers that refused resource ruby_block[config_replicaset] were: []
==> default: [2015-03-06T14:38:25+00:00] DEBUG: providers that support resource ruby_block[config_replicaset] include: [Chef::Provider::RubyBlock]
==> default: [2015-03-06T14:38:25+00:00] DEBUG: providers that survived replacement include: [Chef::Provider::RubyBlock]
==> default: [2015-03-06T14:38:25+00:00] WARN: Cannot search for member nodes with chef-solo, defaulting to single node replica set
==> default: [2015-03-06T14:38:25+00:00] INFO: Configuring replicaset with members vm-riccardi-ses-chef
==> default: [2015-03-06T14:38:25+00:00] ERROR: Failed to configure replicaset, reason: {"startupStatus"=>1, "info"=>"indexers", "ok"=>0.0, "errmsg"=>"all members and seeds must be reachable to initiate set"}
==> default: [2015-03-06T14:38:25+00:00] INFO: ruby_block[config_replicaset] called
==> default:     - execute the ruby block config_replicaset
==> default:
==> default:
==> default:
==> default: * ruby_block[config_replicaset] action nothing[2015-03-06T14:38:25+00:00] INFO: Processing ruby_block[config_replicaset] action nothing (mongodb::replicaset line 225)
==> default:  (skipped due to action :nothing)
==> default: [2015-03-06T14:38:25+00:00] DEBUG: Skipping ruby_block[config_replicaset] due to action :nothing
==> default:
==> default: * ruby_block[run_config_replicaset] action run
==> default: [2015-03-06T14:38:25+00:00] INFO: Processing ruby_block[run_config_replicaset] action run (mongodb::replicaset line 232)
==> default: [2015-03-06T14:38:25+00:00] DEBUG: providers for generic ruby_block resource enabled on node include: [Chef::Provider::RubyBlock]
==> default: [2015-03-06T14:38:25+00:00] DEBUG: providers that refused resource ruby_block[run_config_replicaset] were: []
==> default: [2015-03-06T14:38:25+00:00] DEBUG: providers that support resource ruby_block[run_config_replicaset] include: [Chef::Provider::RubyBlock]
==> default: [2015-03-06T14:38:25+00:00] DEBUG: providers that survived replacement include: [Chef::Provider::RubyBlock]
==> default: [2015-03-06T14:38:25+00:00] INFO: ruby_block[run_config_replicaset] called
==> default:
==> default:     - execute the ruby block run_config_replicaset
==> default:
==> default:
...

What happens?
When creating the service it is enabled and started, and immediately after that the config_replicaset is run (since PR #352, but the issue could still happen without this merge).
However the first service start takes time to pre-allocate the large files on disk. So the service is started, but it's not ready to accept requests yet.
The config_replicaset fails because the server is not ready yet. It logs an error, but the recipe continues to run!

Sub-issue 1:

Why does the recipe continue to run?

Sub-issue 2:

Why does the recipe re-configures the replica set via the additional run_config_replicaset block (in addition to the notify :immediate from the service) ?

Sub-issue 3:

Why does the service start finishes before the server is really started (ie ready to accept requests).

For sub-issue 1 it seems all functions in libraries/ just log error or warnings, but don't return/raise an error...

How to fix those issues?

Sub-issue 1:

How should we return an error from the functions in libraries/mongodb.rb ? And in definitions/mongodb.rb ? Throw an exception (which one?) ? Call Chef::Application.fatal ?

Sub-issue 3:

Should we bug-report to MongoDB Inc.?
Or should we hook into Chef services providers to add a wait loop?
Or should we notify :immediately an extra ruby block that waits?
Or should we add retry loop on libraries/mongodb.rb? (on connection, and for replica-set until replica set ready?)
And/or should we expose a function to wait for server ready?