mit-han-lab/distrifuser

Measuring Communication Amount

Closed this issue ยท 3 comments

Hi, thanks for the great work! I'm wondering how is the communication amount in Table 2 of the paper is calculated. Are those calculations available in the evaluation script?

In this PatchParallelismCommManager, we print the buffer size on each device (self.numel). You can enable the profiling with verbose=True when initializing the distri_config.

For AllGather, when using ring AllGather, the communication amount is $s \times (n-1) \times 2$, where $s$ is the buffer size, $n$ is the number of devices and 2 stands for 2 bytes for FP16 precision.

For AllReduce, when using ring AllReduce, the communication amount is $s \times \frac{n-1}{n} \times 2 \times 2$. The first 2 stands for the 2 rounds of ring AllReduce and the second 2 stands for 2 bytes for FP16 precision. For Tensor Parallelism, our code does not support printing the buffer size s for now. You can easily calculate it by summing up all the AllReduced tensor's numel in attention.py, conv.py, feed_forward.py and resnet.py.

You can refer to our efficientml.ai slides (page 50) for these communication primitives.