NVlabs/timeloop

multicast factor computation for Input operand in convolution

suyashbakshi opened this issue · 8 comments

For the following spatial mapping, the multicast factor for input operand is reported as 14. The input operand is indexed as
Input[C][P+R-1][Q+S-1][F+T-1]. Is the multicast factor computed using the factors for R and P in this spatial mapping? What about the multicasting opportunity due to factor for K?

for K in [0:2) 
   for C in [0:2)
      for R in [0:7)
           for P in [0:8)

Additionally, in the hardware architecture that I am using, I initialized a "simple_multicast" network between the two levels, and a fanout occurs between the parent and the child level. To understand the multicast_factor and the energy computation, I added a std:cout inside this else condition to print the multicast_factors and the network ingresses, and observed that in some cases, there a multiple multicast_factors reported for the Input operand (due to the loop that encloses the above mentioned else condition. However, given that the energy computation inside this else condition has a = rather than a +=, the energy is computed only for the ingresses for only the largest multicast_factor. Why aren't the ingresses for smaller multicast_factors considered in the energy computation?

Hello, I would greatly appreciate any clarification on this. Specifically, what do the multiple multicast_factors represent? And the reason for using only the ingresses for largest multicast_factor for energy computation in the simple_multicast network.

Thank you

I believe input multicast for those spatial loops would be along (R=7) * (K=2) = 14.

Thanks @angshuman-parashar . But given that 'P' is also an indexing term, there has to be multicasting opportunity because of it too.

Moreover, by examining the multicast_factor in the else condition I mentioned in my original post, I observed that the multicast_factors of values '2, 4, 6, 8, 10, 12, and 14' are reported for the Input operand in this mapping (due to the loop for (auto& x: stats_.ingresses.at(pv).stats) enclosing the else condition). Each multicast factor has identical ingresses. However, I observed that there are other mappings where different multicast_factor values have different number of ingresses.

What do different multicast_factors represent for an operand in a given mapping?
Also, Why is the energy not added up (+=) for all multicast_factors, rather than just using the energy for the largest multicast_factor?

Ignore K and C for a moment. You have 1 parent sending data to R*P children. For each tensor coordinate read from the parent, how many children need that data? Input[0] is used by 1 child, Input[1] by 2, and so on until Input[R-1] is used by R. Now multiply each of those multicast factors by K=2 and you get your 2, 4, 6, ... 14 pattern.

In general, heterogeneous multicast factors can also be caused by different patterns at different temporal iterations. E.g., the multicast pattern at iteration 0 of some temporal loop may be different from the pattern at iteration 1, either because of data that was held stationary at the child or because of peer-to-peer transfers. And because the mapping can have multiple temporal loops, there can be a combinatoric explosion in the distinct spatial data movement patterns.

Whether induced by temporal or spatial loops, the total energy is the summation over the number of times each pattern is exercised.

Thank you for the explanation @angshuman-parashar . That's very helpful.

Whether induced by temporal or spatial loops, the total energy is the summation over the number of times each pattern is exercised.

So am I correct in understanding that the = in simple_multicast network should indeed be +=?

It's been a while but I believe simple_multicast was intended to model a hardware device that does not have much runtime configurability. So it really can't support mappings that have varying multicast requirements. If my recollection is correct, there really should be a check in there that throws an error if the multicast signature has more than 1 bucket.

Alternatively, the code can be repurposed to model a more sophisticated configurable device, in which case the energy equation should use += as you observed.

Thanks @angshuman-parashar. Good to know that this is an easy fix, but it will cost a fair amount of compute time for redoing several measurements :(

Before committing to those experiments, I recommend running some small test cases to make sure the numbers line up with your expectations.