NVlabs/timeloop

# of reductions counts not as expected

hqjenny opened this issue · 0 comments

The total number of reductions currently is not equal to the theoretical total number of # elementwise ops (theoretical minimal) - # of outputs elements when we summed up the reductions.

Here are two examples:

  1. Temporal reduction enabled at both Register and GlobalBuffer level
MainMemory [ Weights:3072 (3072) Inputs:576 (576) Outputs:512 (512) ]
---------------------------------------------------------------------
| for P in [0:1)

GlobalBuffer [ Weights:3072 (3072) Inputs:576 (576) Outputs:512 (512) ]
-----------------------------------------------------------------------
|   for C in [0:32)
|     for K in [0:2)
|       for R in [0:3)
|         for K in [0:16) (Spatial-X)

RegisterFile [ Weights:1 (1) Inputs:16 (16) Outputs:16 (16) ]
-------------------------------------------------------------
|           for P in [0:16)

Given this mapping for Timeloop tutorial exercise 4. The # of temporal reductions at the RegisterFile level (for each instance) is 3040. It is calculated as 3072(content_accesses)+0(peer_accesses)-32 (partition_size)=3040. The content accesses are (P=16)*(R=3)*(K=2)*(C=32)=3072. The total # of reductions at the RegisterFile is 3040(reductions per instances)*16(instances)=48640. It would be correct if we assume there is no reduction capability at the GlobalBuffer level, but the # of temporal reductions at the Globalbuffer level is 15872 meaning there is reduction enabled at the GlobalBuffer level.
Therefore, the number of reductions at the RegisgterLevel is incorrect.

  1. Mapping that introduces spatial reductions
MainMemory [ Weights:3072 (3072) Inputs:576 (576) Outputs:512 (512) ]
---------------------------------------------------------------------
| for P in [0:1)

GlobalBuffer [ Weights:3072 (3072) Inputs:576 (576) Outputs:512 (512) ]
-----------------------------------------------------------------------
|   for C in [0:2)
|     for K in [0:32)
|       for R in [0:3)
|         for C in [0:16) (Spatial-X)

RegisterFile [ Weights:1 (1) Inputs:16 (16) Outputs:16 (16) ]
-------------------------------------------------------------
|           for P in [0:16)

The reported # of spatial reductions for this mapping is 15360. The # of temporal reduction at the RegisterFile level is 2560. At the GlobalBuffer level is 512. If we use the following formulation to calculate the # of reductions: 2560*16+512+15360=56832. There will be more reductions needed than the minimum.