Distributed training/optimization using Horovod

Question

Distributed training/optimization using Horovod

Nicholas-Autio-Mitchell opened this issue 5 years ago · 0 comments

Nicholas-Autio-Mitchell commented 5 years ago

I am interested in using HpBandSter in a distributed fashion, using Horovod. It essentially performs efficient communication between GPUs primarily for data parallelism, i.e. training a single model with mini-batches being executed on multiple GPUs in a single machine or across multiple nodes.

Do you have any experience with this, or have you seen any examples?

Otherwise, do you know of any examples of using multiple GPUs for Keras (or Tensorflow) models? The documentation talks about workers, but it isn't immediately clear if that can means single GPUs, or groups of GPUs, e.g. on a server with 8 GPU, running two workers each with 4 GPUs.