NVlabs/timeloop

Maximizing PE array utilization in convolution runs

Opened this issue · 1 comments

Hello there,

I am currently working on designing a 200 by 200 PE Convolution accelerator. I have taken the base template from the exercise provided and read through some documentation but my mapping strategies return with about 1-2% utilization.

Here are my input architecture files, parsed_input, generated map, and statistics showing utilization.

My inner PE spatial loop bounds seem to only unroll along the Y-axis with nothing in the X-axis. I believe the issues come from the constraints definition but I also have the intution problem dimensions (VGG) are not suited for a large PE array hence why I try mapping more batches.

Any input is appreciated.

arch_conv.txt
parsed-processed-input-large-pe-array-multi-batch.txt

timeloop-mapper.stats.txt
timeloop-mapper.map.txt

There's something odd. Your spec appears to be creating a 200x200 array but the stats.txt reports 16x16 instances at all inner levels of the hierarchy. Are you sure the stat dump is from this arch?

Overall a 200x200 array is hard to fill spatially. Most mappings will be underutilized, so I suspect the mapper search is just giving up too quickly. Try tweaking the hyperparameters to make it try harder. Also, in your innermost buffer constraints you should add a min parallelism constraint (e.g., 0.5). This will early-reject any mappings that don't have at least 50% utilization. You won't prevent the search heuristic from visiting such mappings, but you will elide the expensive evaluation cost for these mappings.