Allgatherv?

Question

Allgatherv?

tgaddair opened this issue 6 years ago · 8 comments

I've been exploring adding support for Gloo as an alternative to MPI in Horovod. For our allgather implementation, we make use of MPI's Allgatherv function to gather varying amounts of data from each rank.

Looking at the Gloo API, I don't immediately see any equivalent to Allgatherv. Do you know if this something that's currently supported, either directly or indirectly?

Thanks.

Answer 1 · 2019-03-11T17:43:57.000Z

Hi @tgaddair!

We don't have allgatherv at the moment. I don't think it would be too hard to add though, but you could work around it today by doing a max allreduce on the element size, followed by a normal allgather. I assume this is only for metadata to bootstrap Horovod? If so, the extra call may be acceptable for you.

Answer 2 · 2019-03-11T21:02:29.000Z

Thanks for the response!

We can definitely implement the workaround you suggested today and see how that performs, then switch over once the features is fully supported.

For context, we currently use allgatherv for collecting sparse gradients (as an alternative to allreduce for things like training a word embedding model). So long term, an optimized solution would be preferred. But I'll go ahead start with the approach you suggested.

Thanks!

Answer 3 · 2019-03-11T21:25:45.000Z

I see, thanks for the additional information. Allgatherv makes sense to use there. Granted, if the level of sparsity is wildly different between participants, I assume the load on the network will be skewed as well, being constrained on the bigger participant (when using a simple ring algorithm). If so, having the under-contributing processes send extra zeros shouldn't have a big impact on the overall runtime.

By the way, allgatherv is not planned, but is more than welcome of course if you'd like to contribute! I think https://github.com/facebookincubator/gloo/blob/master/gloo/allgather.cc is a good starting point. It could be made even simpler if you don't use the chunking (which is done there to try and pipeline).

Answer 4 · 2019-03-11T21:44:06.000Z

Thanks, I'll happily send you a PR if it turns out we need a more optimized solution.

Answer 5 · 2019-03-11T22:22:41.000Z

Awesome, thank you. I'll close this issue in the mean time.

Answer 6 · 2019-05-01T22:27:59.000Z

@tgaddair FYI, adding allgatherv in #177.

Answer 7 · 2019-05-17T23:15:26.000Z

Thanks @pietern! Looking forward to trying it out.

By the way, a while back a took a quick stab at something similar, but implemented it based on the AllgatherRing implementation. Can you explain the relationship between these APIs? Is this API (allgather and allgatherv) an abstraction over the topology-specific approaches like AllgatherRing?

Answer 8 · 2019-07-22T11:30:43.000Z

@tgaddair Hey Travis, sorry for the delayed response.

The difference is that the functions can be called without creating an instance for every operation. Back when we first wrote Gloo we only had to support Caffe2, where we could rely on the dimensions of the tensors in the graph to never change. Therefore we could pre-allocate and initialize everything for every operation just once and then run it many times. This is not compatible with PyTorch where every operation may use different dimensions. The function based API is a better fit here. Going forward, I would like to deprecate the old ones in favor of the new ones. This is not documented and should have been. Thanks for asking the question and bringing up the issue.