Uber Horovod Testing
Closed this issue · 6 comments
Currently we need an image to test on CPU
@Tomcli I'm interested in this as well. Currently, I get
Deploying model with manifest 'manifest_tfmnist.yml' and model files in './horovod'...
FAILED
Error: tensorflow version 1.5-py3-horovod not supported.
FAILED
Error 200: OK
So I need the Horovod image. Any chance you could provide steps to reproduce your working CPU version of Horovod?
Thanks in advance.
- The horovod example in #79 is at https://github.com/fplk/FfDL/blob/merge_20180514_1536/etc/examples/horovod/manifest_tfmnist.yml
- A working Horovod image is available at https://github.ibm.com/Tommy-Chaoping-Li/dlaas-horovod. Since using our own images will need some core changes on how we deal with dependencies and train.sh, ultimately I want to take out the train.sh and turn it into something similar to launcher.py in native Tensorflow Distributed Learning (work in progress).
-
Using the image and example above should able to run Horovod Tensorflow in both CPU and GPU.
-
Horovod PyTorch is working on CPU, GPU is still under testing.
@Tomcli I'm interested in this as well. Currently, I get
Deploying model with manifest 'manifest_tfmnist.yml' and model files in './horovod'... FAILED Error: tensorflow version 1.5-py3-horovod not supported. FAILED Error 200: OK
So I need the Horovod image. Any chance you could provide steps to reproduce your working CPU version of Horovod?
Thanks in advance.
Have you solved this problem?
Hi @fzuwill, this issue was solved with #104. Our Horovod examples are available at https://github.com/IBM/FfDL/tree/master/etc/examples/horovod . If you want to use a later version of Horovod, simply change the framework version section to anything that available at Horovod's DockerHub.