NVlabs/timeloop

Output tiles

nivi1501 opened this issue · 4 comments

Hi,
I wish to keep track of all the output tiles which are being written to the main memory. Basically, I want to:

  1. Assign a tile number to each output tile.
  2. Increment the tile number whenever the output tile is updated (suppose I have 3 input tiles, which updates the output tile)
  3. Then the tile number of the output tile will be 3.
    Will it be possible to perform such an analysis using TL? How?

What if the same tile is written multiple times (i.e., read-modified-updated)? Do you want to count those as multiple tile writes, or do you only want to count the number of distinct tiles? In either case you just want the total number (and are not trying to generate a trace of labelled tile writes), correct?

I am trying to generate a trace of labeled tile writes and reads (from DRAM to Global buffer). Basically, I am trying to study the DRAM to Global buffer traffic at the tile level. I wish to generate traces similar as:

(TileID) (Number of elements in a tile) (Type of access (R/W))
T1 512 R
T2 1024 W
....
....
T2 1024 W
...
...

I am already familiar with how to estimate tile sizes and the total number of tiles using Timeloop. I just wish to know if I can generate a similar trace file using TL and if it is possible, then I should focus on which source file to generate this. Any help in this matter will be highly appreciated. Looking forward to your reply.

Try the tracing feature. It will emit a trace of the axis-aligned hyper-rectangles that the nest analysis visits at each coordinate in space-time.

You will also have to disable temporal (and maybe spatial) extrapolation. Note that this will massively slow down simulation speed. This is because with extrapolation disabled Timeloop starts behaving more like a cycle-level simulator than a fast analytical model. You should also probably only use this with timeloop-model on a specific mapping. Using tracing with the mapper will just generate a ton of noise that's hard to deal with.

To enable all this, set the following env variables:

TIMELOOP_ENABLE_TRACING=1
TIMELOOP_DISABLE_TEMPORAL_EXTRAPOLATION=1
TIMELOOP_DISABLE_SPATIAL_EXTRAPOLATION=1

and then run timeloop-model as you normally do.

The trace output will look something like this:

    t/7/ s/0/ Weights: { [0,0,0,0:2,256,1,1), } Inputs: { [0,0,0,14:1,2,8,28), } Outputs: { [0,0,14,0:1,256,28,8), } 
      t/8/0/ s/0/0/ Weights: { [0,0,0,0:2,16,1,1), } Inputs: { [0,0,8,14:1,2,16,15), } Outputs: { [0,0,14,8:1,16,15,16), } 
      t/8/1/ s/0/0/ Weights: { [0,128,0,0:2,144,1,1), } Inputs: { [0,0,8,14:1,2,16,15), } Outputs: { [0,128,14,8:1,144,15,16), } 
      t/8/2/ s/0/0/ Weights: { [0,0,0,0:2,16,1,1), } Inputs: { [0,0,8,16:1,2,16,17), } Outputs: { [0,0,16,8:1,16,17,16), } 
      t/8/3/ s/0/0/ Weights: { [0,128,0,0:2,144,1,1), } Inputs: { [0,0,8,16:1,2,16,17), } Outputs: { [0,128,16,8:1,144,17,16), } 
    t/8/ s/0/ Weights: { [0,0,0,0:2,256,1,1), } Inputs: { [0,0,8,14:1,2,16,28), } Outputs: { [0,0,14,8:1,256,28,16), } 
  t/ s/ Weights: { [0,0,0,0:2,256,1,1), } Inputs: { [0,0,0,0:1,2,56,56), } Outputs: { [0,0,0,0:1,256,56,56), } 

Here's how to read the trace:

  • t/.../.../... is a time-stamp.
  • s/.../.../... is a space-stamp.
  • The indentation level and the number of coordinates in the space/time stamp tells you the hardware tiling level you're looking at. E.g., the rank-0 stamps t/ and s/ refers to the outermost (e.g., DRAM) level, because the tile never changes there over space or time -- it's the complete tensor. The rank-1 stamps t/8/ and s/0/ refer to the next-inner level (probably the GlobalBuffer), and in this case is telling you the tile resident at the GlobalBuffer space-coordinate 0 and at time-step 8. As you go deeper into the hierarchy, the rank order of the time and space stamps increases.
  • Weights: { [0,0,0,0:2,16,1,1), } says that at this space/time coordinate the mapping installs a Weights tile that is represented by an axis-aligned hyper-rectangle between the points [0,0,0,0] (inclusive) and [2,16,1,1] (exclusive).
  • Note that these tiles that are being printed out are what we call the "T-relation", i.e., they are the tiles that are present in the hardware space-time coordinate. They are not the "Delta-relation", i.e., they do not represent the incremental data that is moved in to construct the tile. Based on your original ask, I believe you may be more interested in the Delta trace. It should be relatively straightforward to extend this existing tracing code in nest-analysis.cpp to optionally emit the Delta trace as well. This will be a valuable contribution to the tool.
  • Also note that this tracing is at the abstract nest analysis level, and so does not understand bypassing. So even if your mapping does not store, e.g., Outputs, at the GlobalBuffer, the trace will show that tensor there. Bypassing is modeled as a post-processing step in tiling.cpp. By the time Timeloop gets to that stage of processing, all fine-grained information about space and time is discarded, and it's not generate a trace there. So you may have to do some outboard post-processing if you want to incorporate bypassing into the trace.

For more background on hierarchical space/time stamps you can refer to this paper: https://research.nvidia.com/publication/2021-01_hardware-abstractions-targeting-eddo-architectures-polyhedral-model

Thanks a lot for sharing this valuable information. This precise explanation helped me a lot. I tried generating the 'delta' trace and got the following results.
` t/0/191/ s/0/10/ Weights: { [26,31,2:27,32,3), } Inputs: { [26,17:27,18), } Outputs: { }

  t/0/191/ s/0/11/ Weights: { [27,31,2:28,32,3), } Inputs: { [27,17:28,18), } Outputs: { } 

  t/0/191/ s/0/12/ Weights: { [28,31,2:29,32,3), } Inputs: { [28,17:29,18), } Outputs: { } 

  t/0/191/ s/0/13/ Weights: { [29,31,2:30,32,3), } Inputs: { [29,17:30,18), } Outputs: { } 

  t/0/191/ s/0/14/ Weights: { [30,31,2:31,32,3), } Inputs: { [30,17:31,18), } Outputs: { } 

  t/0/191/ s/0/15/ Weights: { [31,31,2:32,32,3), } Inputs: { [31,17:32,18), } Outputs: { } 

t/0/ s/0/ Weights: { [0,0,0:32,32,3), } Inputs: { [0,0:32,18), } Outputs: { [0,0:32,16), } 

t/ s/ Weights: { [0,0,0:32,32,3), } Inputs: { [0,0:32,18), } Outputs: { [0,0:32,16), }
`
Now, I just need to focus on the DRAM to global buffer tile movement (the rest of the stuff is just noise to me). What I can deduce is at t/1/ s/0/, an additional 11616 weights and 2497 input elements are read from the DRAM as you mentioned "Delta trace represents incremental data i.e. moved to construct the tile"
However, the output remains stationary in the global buffer. Please let me know if my inferences are correct.

t/0/ s/0/Weights = 11616, Inputs = 2497, Outputs = 5280 t/1/ s/0/Weights = 11616, Inputs = 2497, Outputs = 0 t/2/ s/0/Weights = 11616, Inputs = 2497, Outputs = 0 t/3/ s/0/Weights = 11616, Inputs = 2497, Outputs = 5280 t/4/ s/0/Weights = 11616, Inputs = 2497, Outputs = 0 t/5/ s/0/Weights = 11616, Inputs = 2497, Outputs = 0 t/6/ s/0/Weights = 11616, Inputs = 2497, Outputs = 5280 t/7/ s/0/Weights = 11616, Inputs = 2497, Outputs = 0 t/8/ s/0/Weights = 11616, Inputs = 2497, Outputs = 0 t/9/ s/0/Weights = 11616, Inputs = 2497, Outputs = 5280 t/10/ s/0/Weights = 11616, Inputs = 2497, Outputs = 0 t/11/ s/0/Weights = 11616, Inputs = 2497, Outputs = 0 t/12/ s/0/Weights = 11616, Inputs = 2497, Outputs = 5280 t/13/ s/0/Weights = 11616, Inputs = 2497, Outputs = 0 t/14/ s/0/Weights = 11616, Inputs = 2497, Outputs = 0 t/15/ s/0/Weights = 11616, Inputs = 2497, Outputs = 5280 t/16/ s/0/Weights = 11616, Inputs = 2497, Outputs = 0 t/17/ s/0/Weights = 11616, Inputs = 2497, Outputs = 0 t/18/ s/0/Weights = 11616, Inputs = 2497, Outputs = 5280 t/19/ s/0/Weights = 11616, Inputs = 2497, Outputs = 0 t/20/ s/0/Weights = 11616, Inputs = 2497, Outputs = 0 t/21/ s/0/Weights = 11616, Inputs = 2497, Outputs = 5280 t/22/ s/0/Weights = 11616, Inputs = 2497, Outputs = 0 t/23/ s/0/Weights = 11616, Inputs = 2497, Outputs = 0 t/24/ s/0/Weights = 11616, Inputs = 2497, Outputs = 5280 t/25/ s/0/Weights = 11616, Inputs = 2497, Outputs = 0 t/26/ s/0/Weights = 11616, Inputs = 2497, Outputs = 0 t/27/ s/0/Weights = 11616, Inputs = 2497, Outputs = 5280 t/28/ s/0/Weights = 11616, Inputs = 2497, Outputs = 0 t/29/ s/0/Weights = 11616, Inputs = 2497, Outputs = 0 t/30/ s/0/Weights = 11616, Inputs = 2497, Outputs = 5280 t/31/ s/0/Weights = 11616, Inputs = 2497, Outputs = 0 t/32/ s/0/Weights = 11616, Inputs = 2497, Outputs = 0 t/33/ s/0/Weights = 11616, Inputs = 2497, Outputs = 5280 t/34/ s/0/Weights = 11616, Inputs = 2497, Outputs = 0 t/35/ s/0/Weights = 11616, Inputs = 2497, Outputs = 0 t/36/ s/0/Weights = 11616, Inputs = 2497, Outputs = 5280