NVlabs/timeloop

3D PE array calculations

joon9885 opened this issue · 2 comments

Hello. First of all, thank you for the contributions. I appreciate it.

I am trying to get the energy calculations for 3-dimensional PE arrays with each adjacent PE connected to each other. 3D means the PEs' are connected in x,y and z direction in cartesian coordinates.

I am now looking at the codes to make Timeloop understand the z-direction connection, but due to the complexity of the code I am struggling to find where I can start the modification of the code.

So do you have any suggestion where I can start from? I want to validate that 3D structures indeed will make shorter paths leading to higher energy efficiencies.

Also, since the z-direction connections are different to the 2D connections in terms of the wire energy and delay (such as TSVs, MIVs), how can I take into account those factors?

Thank you.

Yes, that would be an interesting study. There are 3 ways to implement this, and unfortunately all of them will need some code hacking.

Approach 1

The way that X and Y affect energy today can be seen in terms of (a) the number of hops it takes for a parent->{children} data transfer to complete, and (b) the per-hop energy cost.

The (a) hop-count computation is in NestAnalysis::ComputeAccurateMulticastedAccesses(). I recommend looking at the distrib-mcast-expts branch, I made some recent updates in that branch. The logic there states, if I start from a parent and have to make a multicast transfer to children { (x1,y1), (x2, y2), .... (xn, yn) } then what is the number of hops through the interconnection network that that multicast will cost? For your purposes you'll have to augment that with an additional z coordinate and figure out an updated algorithm.

For (b) Timeloop uses a very simple internal floorplanner that places everything hierarchically into a square grid, determines the area of each such hardware tile, determines the linear distance to traverse each tile, and multiplies that with a fJ/bit/mm wire energy cost from the underlying energy model (Accelergy) to derive the cost-per-hop. However, you can simply override the energy-per-hop number in the architecture YAML, which will bypass all of this floorplanning/wire energy derivation. This will work if you can come up with a single uniform pJ/hop to place in the YAML. However, if for example your cost is different depending on whether you are traversing along X, Y, or Z then you'll have to model things more carefully. Your updated hop calculation algorithm will probably need to record hops along each axis, and then the final network model (you'll have to extend src/model/network-legacy.cpp) will have to apply a different cost for each axis. It's more involved but not crazy.

You'll also have to change things up on the mapping side. Right now each spatial level in a mapping can have an X and Y component. The way it works is that each spatial permutation has a split point, which splits the permutation string (e.g., CKRSPQN) into two sets of dimensions, with one set mapped onto hardware X axis and another set onto hardware Y axis. You'll have to extend this to have 2 split points to separate out the dimensions.

Approach 2

I think a MUCH easier way to do this could be a slight hack/workaround -- in your architecture YAML, whenever you have the 3 XYZ dimensions, create an additional Dummy level with only a spatial fanout (see our Eyeriss examples and Timeloop tutorial videos to see how Dummy levels work). This is going to be your "Z" dimension, but you can let Timeloop think that it's the X dimension. I think (but I'm not 100% convinced) that if you set the energy-per-hop costs appropriately for this network, then it may emulate the data movement behavior you're trying to model. But you'll have to stare at the hop-calculation code in NestAnalysis, think about what it's going to do for each level of the network, and convince yourself that it's modeling the right thing. Please feel free to ask any specific questions about the codebase as you investigate this.

Appproach 3

This is the ideal approach. I think the amount of coding here is actually going to be less than in Approach 1, but it involves some refactoring of the codebase. Let me explain.

The fact that X and Y are baked into the mapping today is awful and abstraction-breaking. Mappings should treat a spatial level as a linear fanout to N children without worrying about how those N children are arranged into an XY physical grid (or XYZ, or XYZW, or any arbitrary higher order). This means that the NestAnalysis code (which works with mappings at a much higher level of abstraction) should not be performing any hop calculations, it should simply pass along a signature of which children were touched (in the linear space) by any specific data transfer block. The actual hop calculation would be performed in the network code, which maps that linear space into an actual physical topology. This would dramatically clean up the codebase. However, the data structure that is used to pass data from NestAnalysis into the model (AccessStats) would become more complicated than what it's doing today (which is just capture the average number of hops). This is something we've wanted to do for a long time but just didn't have the bandwidth. If you're interested we'll be happy to guide you and make sure you receive some kudos :-).

Pick your poison. :-)

Thank you for the detailed explanation.

I will look through your answer and consider about future plans.