Accelergy-Project/timeloop-accelergy-exercises

The total energy is not related to the number of arithmetic instances.

Closed this issue · 13 comments

Hello,

In the timeloop+accelergy exercise, I changed the number of PE from 168 to 42. Looking at the timeloop-mapper.stats.txt file I noticed that the Energy (total) of the arithmetic part (level 0) does not change between the version 168 PE and 42 PE contrary to the number of cycle and the surface.

- architecture with 168 PE -> Energy (total)       : 2034656870.40 pJ
- architecture with 42 PE  -> Energy (total)       : 2034656870.40 pJ

Can you tell me why this energy value does not change?

Thank you in advance.

Hello,

In the timeloop+accelergy exercise, I changed the number of PE from 168 to 42. Looking at the timeloop-mapper.stats.txt file I noticed that the Energy (total) of the arithmetic part (level 0) does not change between the version 168 PE and 42 PE contrary to the number of cycle and the surface.

- architecture with 168 PE -> Energy (total)       : 2034656870.40 pJ
- architecture with 42 PE  -> Energy (total)       : 2034656870.40 pJ

Can you tell me why this energy value does not change?

Thank you in advance.

Hi,
In running examples, I see the changes in energy consumption results while changing the number of PEs. Would you tell me in which exercise this happened? Are you sure the constraints conditions meet the number of PEs?

Hi,

I see this with exercises/timeloop+accelergy/ and the command :

timeloop-mapper arch/eyeriss_like-int16.yaml \
                arch/components/*.yaml \
                prob/prob.yaml \
                mapper/mapper.yaml \
                constraints/*.yaml

The only change I made is on line 40:

          - name: PE[0..167] 
          v
          - name: PE[0..41]

I chose the number 42 PE because it will allow Timeloop to determine Ymesh = 4.

Are you sure the constraints conditions meet the number of PEs?

What constraints are you mentioning?

Hi,

I see this with exercises/timeloop+accelergy/ and the command :

timeloop-mapper arch/eyeriss_like-int16.yaml \
                arch/components/*.yaml \
                prob/prob.yaml \
                mapper/mapper.yaml \
                constraints/*.yaml

The only change I made is on line 40:

          - name: PE[0..167] 
          v
          - name: PE[0..41]

I chose the number 42 PE because it will allow Timeloop to determine Ymesh = 4.

Are you sure the constraints conditions meet the number of PEs?

What constraints are you mentioning?

Hi,
I check the exercise and share my results.

Hi,
I run the timeloop+accelergy exercise and simulated both 168 and 42 PE cases. As we can in output results, the total energy consumption is different:
In 168 PEs case -> in timeloop-mapper.stats.txt total energy is 30.99 pJ/MACC (line number 783)
In 42 PEs case -> in timeloop-mapper.stats.txt total energy is 17.49 pJ/MACC (line number 783)
Therefore, as the results show, the number of PEs affects the amount of energy consumption. It should be noted that I used a hybrid algorithm ( Line 7 ) and VGG16_Conv_1.yaml as input model to reduce the simulation time and getting faster results.

Hope this helps.

Thank you very much for your time.

I am sorry, I realize that my question was not clear enough.

On my side also, I get a different pJ/MACC value for the 168 PEs and 42 PEs case. My question was specifically about Energy (total) on line 17 of the timeloop-mapper.stats.txt file.

Even in the example you made, in the file timeloop-mapper.stats.txt at line 17, I see that the value of Energy (total) does not change between 168PEs and 42PEs case. From my point of view this energy value should be lower if there are less PEs. Like the area in line 18 which is lower when there are 42 PEs.

Do you know what is the signification of this energy value? (average energy maybe?)

In my opinion, Energy (total) is the maximum energy consumed by the MAC component and this is not average energy for two reasons:

  1. If that was the average energy value, then it would have to change in proportion to the number of Utilized instances (Line 15 of timeloop-mapper.stats.txt). Because Utilized instances directly affects the MAC energy consumption.
  2. When I changed the mac datawidth from 16 to 32, I saw that the Energy (total) changed from 190749081.60 pJ to 755713179.65 pJ so, this shows that the Energy (total) only depends on the type of MAC component.

Energy (total) is the total amount of energy consumed by all arithmetic units over the duration of the entire workload. If the problem shape does not change, and if properties of the arithmetic unit (bitwidth) do not change, then that total energy will not change. It does not matter if you use 1 PE or 10 million PEs, they will consume exactly the same total amount of arithmetic energy (though one will be slower) to execute a given workload.

@angshuman-parashar
Hi,
Thank you for your explanation.

Hi,
Thank you for your answer @angshuman-parashar

If I understand well, this means that Timeloop takes into account only the dynamic energy and not the static energy.

And in my mind, in an architecture with several PEs, when the number of PEs increases different energy overheads can appear e.g. network connectivity overhead, latency ...

Does Timeloop take into account this kind of overhead depending on the number of PEs used in an architecture?

If I understand well, this means that Timeloop takes into account only the dynamic energy and not the static energy.

Correct.

And in my mind, in an architecture with several PEs, when the number of PEs increases different energy overheads can appear e.g. network connectivity overhead, latency ...

Does Timeloop take into account this kind of overhead depending on the number of PEs used in an architecture?

Yes. When you scale the number of PEs, the energy cost to transfer data from, say, a shared global buffer to those PEs will increase. Timeloop has a built-in floorplanner based on which it lays out those PEs, then estimates the wire distance (based on PE area) and number of hops that needs to be traversed for a piece of data to be sent to one (or more, in case of a multicast) PEs. These are then applied to the pJ/bit/mm wire energy cost (taken from the underlying energy model e.g. Accelergy, or overridden by the user) to determine the network energy. This is of course all hierarchically composable, so if you compose together multiple global buffer tiles and service them from an even larger shared buffer, a similar process is applied at the next level.

Increased pipeline fill latency however is not counted, performance estimation is purely throughput-based.

Energy (total) is the total amount of energy consumed by all arithmetic units over the duration of the entire workload. If the problem shape does not change, and if properties of the arithmetic unit (bitwidth) do not change, then that total energy will not change. It does not matter if you use 1 PE or 10 million PEs, they will consume exactly the same total amount of arithmetic energy (though one will be slower) to execute a given workload.

@angshuman-parashar Does this mean that the dataflow does not change the total energy? Assuming that the change in num of PEs change the design space (dataflow for optimum data re-use) and hence the number of accesses from storage levels, ideally the total energy should change. What do you think?

I was just talking about the arithmetic energy in that comment, which does not change with dataflow. But the data movement energy (storage reads, updates, fills and network data movement energy) certainly changes and is accounted for in Timeloop's modeling.

My bad. Yeah, that makes sense. Thank you!