IBM/FfDL

Uber Horovod Testing

Closed this issue · 6 comments

Currently we need an image to test on CPU

@Tomcli for the Horovod testing we are doing on this PR #79 - which base image we are pulling for Horovod? cc @alsrgv

fplk commented

@Tomcli I'm interested in this as well. Currently, I get

Deploying model with manifest 'manifest_tfmnist.yml' and model files in './horovod'...
FAILED
Error: tensorflow version 1.5-py3-horovod not supported.

FAILED
Error 200: OK

So I need the Horovod image. Any chance you could provide steps to reproduce your working CPU version of Horovod?

Thanks in advance.

@Tomcli can you please update on following for these

  1. Point to horovod tensoflow example in #79
  2. Update on Horovod Tensorflow testing on CPU+GPU
  3. Update on Horovod PyTorch testing on GPU
  1. The horovod example in #79 is at https://github.com/fplk/FfDL/blob/merge_20180514_1536/etc/examples/horovod/manifest_tfmnist.yml
  • A working Horovod image is available at https://github.ibm.com/Tommy-Chaoping-Li/dlaas-horovod. Since using our own images will need some core changes on how we deal with dependencies and train.sh, ultimately I want to take out the train.sh and turn it into something similar to launcher.py in native Tensorflow Distributed Learning (work in progress).
  1. Using the image and example above should able to run Horovod Tensorflow in both CPU and GPU.

  2. Horovod PyTorch is working on CPU, GPU is still under testing.

@Tomcli I'm interested in this as well. Currently, I get

Deploying model with manifest 'manifest_tfmnist.yml' and model files in './horovod'...
FAILED
Error: tensorflow version 1.5-py3-horovod not supported.

FAILED
Error 200: OK

So I need the Horovod image. Any chance you could provide steps to reproduce your working CPU version of Horovod?

Thanks in advance.

Have you solved this problem?

Hi @fzuwill, this issue was solved with #104. Our Horovod examples are available at https://github.com/IBM/FfDL/tree/master/etc/examples/horovod . If you want to use a later version of Horovod, simply change the framework version section to anything that available at Horovod's DockerHub.