Horovod is a distributed training framework for TensorFlow, Keras, PyTorch, and MXNet. Ansible is a radically simple IT automation system. We can easily install the horovod on all server through its automatic setup on AWS or On-premise
- All On-premise nodes should be ubuntu>=16.04. I assumed that all nodes were equipped with Ansible(On-premise)
- Until now, only the examples of tensorflow and pyrotorch can used(Not MXNet, Caffe.. etc YET).
AWS
Step : 0 - 1 -3On-perm
Step : 0 - 2 - 3
All steps will be conducted under Docker container for beginners.
$ docker run -it --name horovod-ansible graykode/horovod-ansible:0.1 /bin/bash
To create horovod clustering enviorment, start provisioning with Terraform
code. Change some option variables.tf
which you want. But you should not below ## DO NOT CHANGE BELOW
.
If I created EC2 with option number_of_worker
3, Total architecture is same with below picture.
$ export AWS_ACCESS_KEY_ID=<Your Access Key in AWS>
$ export AWS_SECRET_ACCESS_KEY=<Your Access Key in Secret>
Initializing terraform and create private key to use.
$ cd terraform/ && ssh-keygen -t rsa -N "" -f horovod
$ terraform init
provisioning all resource EC2, VPC(gateway, router, subnet, etc..)
$ terraform apply
Then, you can get output :
Apply complete! Resources: 12 added, 0 changed, 0 destroyed.
Outputs:
horovod_master_public_ip = <master's public IP>
horovod_workers_public_ip = <worker0's public IP>,<worker1's public IP>
As I said above, assume that all nodes are 'ansible' and network setup is finished. If you want to see install Ansible, Please read Ansible Install Guide on document.
Install ansible
and jinja2
using pip.
$ ../ansible && pip install -r requirements.txt
Set inventory.ini
in Ansible Folder.
master ansible_host=<master's public IP>
worker0 ansible_host=<worker0's public IP>
worker1 ansible_host=<worker1's public IP>
....
worker[n] ansible_host=
[all]
master
worker0
worker1
...
worker[n]
[master-servers]
master
[worker-servers]
worker0
worker1
...
worker[n]
Ping Test to all nodes!
$ chmod +x ping.sh && ./ping.sh
Now ssh configure to using Open MPI, Download Open MPI and build
$ chmod +x playbook.sh && ./playbook.sh
Test all nodes of mpi that it is fine in master node.
$ chmod +x test.sh && ./test.sh
# go to master node.
ubuntu@master:~$ mpirun -np 3 -mca btl sm,self,tcp -host master,worker0,worker1 ./test
Processor name: master
master (0/3)
Processor name: worker0
slave (1/3)
Processor name: worker1
slave (2/3)
I'd like you to change this part fluidly.
-
Install Tensorflow on CPU, Horovod and Run Distributed
$ chmod +x tensorflow.sh && ./tensorflow.sh # go to master node. ubuntu@master:~$ horovodrun -np 3 -H master,worker0,worker1 python3 tensorflow-train.py
-
Install Pytorch on CPU, Horovod and Run Distributed
$ chmod +x pytorch.sh && ./pytorch.sh # go to master node. ubuntu@master:~$ horovodrun -np 3 -H master,worker0,worker1 python3 pytorch-train.py
-
Issue Note : If you want to change framework after install horovod, you reinstall horovod with
HOROVOD_WITH_*
option, '*' is just framework name. please see horovod issue. But in my Ansible Script, I 'm not add it yet.
- Tae Hwan Jung(Jeff Jung) @graykode
- Author Email : nlkey2022@gmail.com