A distributed executor service in ruby to execute multiple tasks concurrently across multiple machines (physical hosts, vm). For instance if you want to process a bunch of tasks concurrently but can't handle the entire task on a single machine. You can use Hive
to distribute the task across multiple machines and retrieve the results as if you were using a single machine while the jobs are distributed seamlessly across multiple nodes.
You need to clone the git repo and build the gem locally:
$ git clone https://github.com/ngrichyj4/hive-io && cd hive-io
$ gem build hive-io.gemspec
And then execute:
$ gem install hive-io-version.gem # use appropriate version
Require hive
in your project and setup the worker nodes.
The executor service provides an interface for you to distribute tasks to multiple workers. Each task will assigned to a node in the network for execution.
NOTE
: Each task will be executed externally from the current process and must be atomic
, i.e self containing and does not depend on any variables, methods, classes i.e objects from the current process.
#consumer.rb
require 'hive'
Hive::Executor::Service.boot #> Start the executor service on ::3333 for worker nodes to connect to.
class Consumer
attr_accessor :service
def initialize
@service = Hive::Executor::Service.create(
node: {
trigger: self #> an instance of a object that responds to #perform to receive node results
}
)
get!
end
def get!
urls = ['www.example1.com','www.example2.com', 'www.example3.com']
urls.map do |uri|
@service.execute! do
#> Each worker will be assigned a uri to open
Hive::Task.new %{
uri = "#{uri}"
open(uri).read
} #> Atomic
end
end
end
# Will be triggered by executor with each result
def perform arg
p arg.result
end
end
In the above get!
example each worker will be sent the string "open('www.example(i).com').read"
. This string will be evaluated with eval
by the worker and the result will be returned. This task is atomic because the uri
variable was evaluated in the main process and passed along in the string.
The incorrect way to write get!
would have been:
def get!
urls = ['www.example1.com','www.example2.com', 'www.example3.com']
urls.map do |uri|
service.execute! do
Hive::Task.new %{
open(uri).read
} #> Wrong
end
end
end
In this example, the following string will be sent instead "open(uri).read"
. The worker does not know what uri
is and you would get NameError: undefined local variable or method "uri" for ...
You can also specify a custom ipaddr:port
the executor service should bind to:
Hive::Executor::Service.boot ipaddr: '0.0.0.0', port: '7556' #> Default is 127.0.0.1:7556
To check if the executor service is running:
$ lsof -i :7556
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
ruby 16443 ngrichyj4 11u IPv4 0x59aa58819e8169df 0t0 TCP localhost:7556 (LISTEN)
To stop the executor service your can either call Hive::Executor::Server.stop!
or from the terminal run:
$ kill -9 [PID] # where PID is the process id.
Configure the worker node with the ip address
of where the executor service
is been used (i.e your project). The worker node will make multiple attempts to connect to the executor until a connection as been established. The worker will then receive and execute tasks from the executor service and send back the result(s).
IMPORTANT
: You need to setup the worker node to include all the libraries that will be required during the execution of your task.
#worker.rb
# Prepare worker to include all gems that will be needed by executor service
# You can use bundler to include the gems from your Gemfile
require 'rubygems'
require 'hive'
require 'bundler/setup'
Bundler.require(:default)
require 'open-uri' #> required since been used by executor service task
class Worker
include Hive::Worker
end
Worker.run! if __FILE__==$0
You can then start multiple instances of the worker on one or more machines.
Tip
: Use a docker swarm or kubernetes :)
ruby worker.rb --master 127.0.0.1:7556 --threadpool 10 # Worker will continuously poll executor until a connection is established
threadpool
option specifies how many concurrent task each worker node should execute simultaneously.
You should ideally deploy the worker node on thousands of containers across multiple machines using a docker swarm and enjoy process monitoring with automatic restarts if the worker crashes. I've included a simple example that uses docker-compose
with docker
to spawn multiple workers on a single machine See examples. Or you can use a docker swarm
instead for a multi node cluster.
You need to have docker
and docker-compose
installed on your machine. If you're using ubuntu you can use the installation scripts found in scripts.
Clone the project if you haven't already and from the examples
directory change the MASTER_NODE
entry in docker-compose.yml
to the ip address
of your executor node. If you're testing locally on linux use 127.0.0.1
on Mac OSX use your local ip address instead ex: 192.168.1.88
.
IMPORTANT:
If you are testing locally on Mac OSX
. You need to bind the executor service to 0.0.0.0
. If not, the workers inside the docker container will not be able to connect to it
And then execute the following commands in the examples
directory:
$ docker-compose build
$ docker-compose up -d --scale worker=100
$ docker-compose logs -f #> to view logs
You should now have 50 workers ready for use from your executor service.
- Fork it ( https://github.com/ngrichyj4/hive-io/fork )
- Create your feature branch (
git checkout -b my-new-feature
) - Commit your changes (
git commit -am 'Add some feature'
) - Push to the branch (
git push origin my-new-feature
) - Create a new Pull Request