NVlabs/timeloop

Problem in getting optimal dataflow

AsicDyc opened this issue · 10 comments

Hi,

Thank you for developing such an amazing tool and providing a clear tutorial for it. I'm currently designing a dataflow for a conv 2d workload and have encountered some challenges in achieving the optimal dataflow.

I found that when no constraints are provided in the arch.yaml file, the timeloop mapper can only return a suboptimal solution or even an incorrect dataflow with pe utilization = 0.3 and pJ/compute = 24.242, while when I give explicit constraints, the timeloop mapper can return a dataflow with pe utilization up to 0.96 and pJ/compute = 9.475.

I really wonder why arch.yaml with no constraints cannot find a better dataflow. My initial thought was that an unconstrained arch.yaml would generate a larger mapspace, potentially encompassing the mapspace created by a constrained arch.yaml. However, the results suggest otherwise.

Could you provide insight into why an unconstrained arch.yaml fails to identify more optimal dataflows? Any guidance or suggestions would be greatly appreciated.

Here is the constrained arch.yaml file

architecture:
  version: 0.4
  nodes: # Top-level is hierarchical
  - !Container # Top-level system
    name: system
    attributes:
      technology: "32nm"
      global_cycle_seconds: 1e-9

  - !Component # DRAM main memory
    name: DRAM
    class: DRAM
    attributes:
      type: "LPDDR4"
      width: 64
      datawidth: 8
      read_bandwidth: 8589935000
      write_bandwidth: 8589935000

  - !Container
    name: FPGA

  - !Component
    name: GlobalBuffer
    class: SRAM
    attributes:
      depth: 204800
      width: 16
      n_banks: 3
      datawidth: 8
      read_bandwidth: 8388608
      write_bandwidth: 8388608
    constraints:
      temporal:
        permutation: [Q]
        factors: [Q>=16]

  - !Container
    name: PE_column
    spatial: { meshX: 16 }
    constraints:
      spatial:
        permutation: [M]
        factors: [M=16]

  - !Container
    name: PE
    spatial: { meshY: 5 }
    constraints:
      spatial:
        permutation: [P]
        factors: [P=5]

  - !Component
    name: Register
    class: regfile
    attributes:
      depth: 128
      width: 8
      datawidth: 8
    constraints:
      temporal:
        permutation: [R, S, C]
        factors: [R=4, S=4, C=3]

  - !Component # MAC unit
    name: MAC
    class: intmac
    attributes:
      multiplier_width: 8
      adder_width: 8

And here is the prob.yaml file

problem:
  version: 0.4
  instance:
    C: 3
    Hdilation: 1
    Hstride: 2
    M: 32
    N: 1
    P: 112
    Q: 112
    R: 4
    S: 4
    Wdilation: 1
    Wstride: 2
  shape:
    coefficients:
    - default: 1
      name: Wstride
    - default: 1
      name: Hstride
    - default: 1
      name: Wdilation
    - default: 1
      name: Hdilation
    data_spaces:
    - name: Weights
      projection:
      - - - C
      - - - M
      - - - R
      - - - S
    - name: Inputs
      projection:
      - - - N
      - - - C
      - - - R
          - Wdilation
        - - P
          - Wstride
      - - - S
          - Hdilation
        - - Q
          - Hstride
    - name: Outputs
      projection:
      - - - N
      - - - M
      - - - Q
      - - - P
      read_write: true
    dimensions:
    - C
    - M
    - R
    - S
    - N
    - P
    - Q
    name: CNN_Layer

The problem with a completely unconstrained search is that the space is so vast that you cannot afford to run an exhaustive search. Even if you are comparing heuristic searches on constrained-vs-unconstrained mapspaces, the probability of arriving at a near-optimal solution (for that space) is much higher with the smaller constrained space.

It's a tough problem. Better heuristics help. GeorgiaTech's GAMMA mapper was ported to work with Timeloop, but is not actively supported.

In our experience constraining the mapspace seems to be the best strategy.

That said, for small-ish architectures you should be able to run an exhaustive search (set algorithm to linear-pruned and both timeout and victory-condition to 0). If you run it to completion it should find the strictly optimal mapping. If this is not happening please let us know ASAP because it's clearly a bug. I was also concerned about your assertion that the unconstrained search was giving an "incorrect dataflow". Could you elaborate?

Thanks for your reply.

I apologize for causing you concern; my phrasing was not accurate. When I mentioned an "incorrect dataflow," I actually meant a dataflow that is too far from the optimal solution and offers no practical reference value for the design, not a genuinely incorrect dataflow.

I have tried to run an exhaustive search by setting algorithm to linear-pruned and both timeout and victory-condition to 0, but got a message that no valid mappings found within search criteria. Should I set timeout and victory-condition to -1 instead of 0?

I have tried setting timeout and victory-condition to -1 and started running the search, but the terminal only received some data flows with PE utilization varying from 0.1 to 0.3 and then got stuck for hours. I pressed Ctrl+C, and got the same result from heuristic searches.

The arch.yaml and prob.yaml files I am running are the ones I previously provided. Could this situation be due to my architecture being too complex, causing the exhaustive algorithm to not proceed?

0 is the magic number that causes the search algorithms to ignore those termination conditions. That leaves exhausting the mapspace as the only criteria for the mapper threads to terminate. See here:

if (victory_condition_ > 0 && mappings_since_last_best_update >= victory_condition_)

Since these are unsigned vars, setting them to -1 probably caused them to overflow to UINT_MAX, which would effectively have the same result. So I am surprised (and concerned) you saw different behavior with -1 vs. 0. Could you please confirm that this is true?

If the exhaustive search is triggering successfully, it's probably not stuck but just stops reporting updates because it's not seeing any better mappings. You can turn on live_status: True, hopefully that should show you that the mapper is still running but just not seeing anything good.

It's certainly possible that the mapspace happens to be laid out in such a pathologically poor way that the exhaustive search only gets to the good mappings much later. One suggestion is to add a min-parallelism constraint to the innermost level in your arch (Register) like so (I'm using the v0.3 grammar):

target      : Register
type        : parallelism
min         : 0.9

This will not reduce the mapspace but will early-reject mappings that don't have enough spatial fanout, so the model doesn't waste time evaluating them.

I have retried a search with timeout and victory-condition set to 0, live-status and diagnostics set to True, but still got the message no valid mappings found within search criteria.

This is what the live-status returned:

================================================================================
                                TIMELOOP MAPPER
================================================================================
TID      Total    Invalid      Valid    Consec.       Last   Opt.util Opt.energy
                                        invalid     update
--------------------------------------------------------------------------------
- 0          1          1          0          1          0
- 1          1          1          0          1          0
- 2          1          1          0          1          0
- 3          1          1          0          1          0
- 4          1          1          0          1          0
- 5          1          1          0          1          0
- 6          1          1          0          1          0
- 7          1          1          0          1          0

It seems that there was only one mapping in every thread, and each one of them was invalid.

The diagnostic returns one fail class, the Fanout fail class.

Here is my mapper.yaml file:

mapper:
  version: 0.4
  optimization_metrics: [ edp ]
  live_status: True
  num_threads: 8
  timeout: 0
  victory_condition: 0
  algorithm: linear_pruned
  max_permutations_per_if_visit: 16
  diagnostics: True

Could you please check if my settings have understood your requirements correctly? Are there any settings that are unreasonable, leading to the outcome above?

I retried a search with timeout and victory-condition set to -1, live-status set to True, by observing the output of the live status, it's evident that the exhaustive search is indeed ongoing.

However, by repeating the search, it has indeed been confirmed that setting timeout and victory-condition to 0 and -1, respectively, results in different behavior.

Understood. Thank you for helping us by re-verifying the behavior. Let me look into it.

But in the meantime, I recommend proceeding with constrained searches.

Thank you for your patience and guidance. I'm looking forward for your further reply.