ZigZag-Project/zigzag-v1

Understanding input data reuse

suyashbakshi opened this issue · 2 comments

Hello, I'm trying to understand how the reuse for input operand is computed at each memory level. But, I can't understand the point behind the "if-else" condition at:
https://github.com/ZigZag-Project/zigzag/blob/2a6d04d1bdce9731fcc7ec8df6f7acbb8a24bb69/classes/temporal_loop.py#L245

Specifically, the "else" condition is confusing to me. Shouldn't the reuse be computed just as the "if" condition does for each memory level? Why multiply the MAC ops with loop range of innermost_relevant_loop of the memory level which is one level above the level we are computing input reuse for?

Let me ask using a sample mapping I've been using to understand those conditions:

==L2==				
	(4) OY:2
		(3) OX:14	
==L1==
			(4) OY:7
				(5) C:2
					(6) K:13
						(4) OY:7
==L0==
							(3) OX:7
								(1) FX:3
==MAC==
									(5) Cu:14
										(6) Ku:5
											(2) FYu:3

For L0:

  • Reuse is computed using "if" condition since the "innermost_relevant_loop[level+1]" is [(4, 7)] and no FY loop is in L0 temporal loops.

  • the total ops are Cu * Ku * FYu * FX_L0 * OX_L0 = 4410

  • num_of_input_elem computed in the "if" condition = (OY + FYu - 1) * (OX_L0 + FX_L0 - 1) * Cu = (1 + 3 - 1) * (7 + 3 - 1) * 14 = 378

  • and so reuse = 4410/378 = 11.6667

For L1:

Thanks

  • Suyash Bakshi

Hi Suyash,

Theoretically speaking, whether 1) we have the IF-ELSE condition or 2) we don't have and always directly execute statement under the current IF statement, both 1)2) cases can be correct depending on the real HW implementation. We have assumed the optimal HW implementation thus we have used the IF-ELSE condition. Let me first explain to you what the IF-ELSE condition does, and then explain the HW implication related to them.

What the IF-ELSE condition does is checking the above architectural level's innermost loop to see whether it is a pr loop and whether its pr loop pair have shown up in the below levels, if both of these conditions are met, the ELSE branch will be taken and this innermost pr loop of the above level will be merged down to the below level to increase the data reuse of the below level.

An example briefly discussed in our paper is shown below (I didn't use your example because I think an easier example can help you understand quicker): the horizontal line between FX3 and the OXu4 is the level divider, i.e., you can see the OXu4 is a spatial loop unrolled at Input Reg level while the FX3 is a temporal loop scheduled at the Input Local Buffer level. Let's only focus on these two loops for simplicity.

image

Firstly, to make things clear, we define the data reuse factor for a certain operand at a certain level as: for each data element of that operand, how many times it can be reused at that level after it is being sent to that level and before it is overwritten by the next batch of data. For Output and Weight, it equals the product of all their ir loop dimensions of that operand at that level, which is quite straightforward.

Now let's back to the example, if we want to compute Input's Data Reuse factor at I-Reg level, how should we do it? Following the IF statement, the result will be 1, since at I-Reg level there is only OXu4 loop, the total data size for Input is 4 and the total MAC Op is also 4; Following the ELSE statement, we will first merge the above level FX3 down into the I-Reg level and then compute the data reuse factor, the result will be 12 MAC Op / (3+4-1) Data Elem = 2

Which result do you think makes more sense? Actually, both make sense under certain situations, depending on the HW design. The first case means that each cycle the MAC array receives 4 data, it doesn't care that in the next cycle, among the 4 new data it will receive, there are 3 data it has already received in the previous cycle and can be reused. In the second case, the inter MAC unit communication is enabled, and each date will only be fetched from the I-LB level once and being reused at I-Reg level 2 times on average. The below figure illustrates what I have explained.

image

You may wonder why don't we directly write the FX3 loop into the I-Reg level in the first place when generating this mapping representation to avoid this transferring troublesome. The reason is that if we indeed write the FX3 loop down, it means the I-Reg level can hold 6 data instead of 4, which is not true.

We are currently developing a new version of ZigZag, in which we have made this function user-definable (depending on the real HW architecture capability). For now, we take the best case that a hardware design can achieve.

Very good question, Suyash!

Best regards,
Linyan

Linyan, thank you for that explanation. I understand now.