NVlabs/timeloop

Arithmetic unit latency and throughput

MustafaFayez opened this issue ยท 8 comments

After looking at the code, I now think timeloop assumes every compute operation/action in an arithmetic unit takes one cycle, is that true?
If so, is there a way to change that fact without changing the src codes?

There's a way to achieve this in post-processing. The overall cycle-count that Timeloop reports is effectively the max of the cycle counts reported by each level (including arithmetic and storage). The code isn't written that way, but it effectively amounts to this behavior. To simulate a different arithmetic rate, you can take the reported arithmetic cycles and scale that by whatever you want, then compare that scaled cycle count to the unscaled cycle counts for the other storage levels to determine the overall performance.

This should work on the model. On the mapper it will work if performance is not an optimization metric. If it is, then it may lead the mapper astray since the mapper's feedback loop will be working off of the internal cycle count and you cannot inject your hacked cycle count into the feedback loop.

Thanks for the prompt response! I Just wanted to confirm with an existing example (PIM example since the MAC there takes more than one cycle). So, how do you account for MAC taking more than one cycle in the PIM example? My guess is that happens through the A2D component, could you confirm and explain how it happens?

Paging @nellie-wu since I am not a PIM expert.

BTW I should mention that if your arithmetic is pipelined with a rep-rate of 1 cycle then you shouldn't need any hacks -- the existing code should give you approximately the right answer (it will be off by a few cycles thanks to the increased pipeline fill latency, but Timeloop ignores that today anyway). It's only if your rep-rate is > 1 cycle that you need to use the scaling I mentioned.

I get your point, yes. Actually, the case I was mentioning (similar to the provided PIM example) has < 1 MAC/cycle throughput, so I think I need the hack you mentioned.
I was also curious to see if going with the way the PIM example has been written would be more accurate.

The PIM example is a simple setup that also assumes 1MAC/cycle. However, the logical MAC unit is composed of many memory cells, each with a much smaller resolution. If by <1MAC/cycle, you were referring to the bit-serial processing of the input activations, that aspect can be represented by adding another dimension to the problem setup for the input activation. In that case, we will be able to model the fact that we need n cycles for input activations with n bits.

Could you elaborate more on adding another dimension to the problem setup? As in how is that added in the PIM example?

The PIM example currently does not have bit-serial modeling. To do that, we can add a new dimension to the problem specification similar to the spec below for a 8b input activation case (note the extra B dimension in both the problem and the shape). Then we can have a temporal loop for B in the constraints/mapping to model the fact that each bit takes a cycle. Of course, the architecture compound components definitions need to be properly updated so that their energy is characterized under the bit-serial assumptions as well.

Our team is working on more complex PIM specifications, including bit-serial processing, and will have the specs available when ready.

problem:
  instance:
    C: 3
    M: 64
    N: 1
    P: 224
    Q: 224
    R: 3
    S: 3
    B: 8
  shape:
    coefficients:
    - default: 1
      name: Wstride
    - default: 1
      name: Hstride
    - default: 1
      name: Wdilation
    - default: 1
      name: Hdilation
    data-spaces:
    - name: Weights
      projection:
      - - - C
      - - - M
      - - - R
      - - - S
    - name: Inputs
      projection:
      - - - B
      - - - N
      - - - C
      - - - R
          - Wdilation
        - - P
          - Wstride
      - - - S
          - Hdilation
        - - Q
          - Hstride
    - name: Outputs
      projection:
      - - - N
      - - - M
      - - - Q
      - - - P
      read-write: true
    dimensions:
    - C
    - M
    - R
    - S
    - N
    - B
    - P
    - Q

Thanks @nellie-wu! Can't wait to try this new feature out