harvard-acc/gem5-aladdin

Systolic arrays sizes

mateo-vm opened this issue · 7 comments

I am testing the systolic array accelerator, and I am having problems when changing the array sizes. For the default configuration (8x8), everything works fine.

For smaller sizes, it never ends. It gets stuck in a loop, and never leaves. The following is an example of a loop for a 4x4 systolic array. This happens even for the base test.c.

963152000: system.systolic_array_acc.input_fetch0: Fetch queue occupied space: 0 / 32, allFetched: 1, allConsumed: 1, arrived at barrier: 1.
963152000: system.systolic_array_acc.input_fetch1: Fetch queue occupied space: 0 / 32, allFetched: 1, allConsumed: 1, arrived at barrier: 1.
963152000: system.systolic_array_acc.input_fetch2: Fetch queue occupied space: 0 / 32, allFetched: 1, allConsumed: 1, arrived at barrier: 1.
963152000: system.systolic_array_acc.input_fetch3: Fetch queue occupied space: 0 / 32, allFetched: 1, allConsumed: 1, arrived at barrier: 1.
963152000: system.systolic_array_acc.weight_fetch0: Fetch queue occupied space: 0 / 32, allFetched: 1, allConsumed: 1, arrived at barrier: 1.
963152000: system.systolic_array_acc.weight_fetch1: Fetch queue occupied space: 0 / 32, allFetched: 1, allConsumed: 1, arrived at barrier: 1.
963152000: system.systolic_array_acc.weight_fetch2: Fetch queue occupied space: 0 / 32, allFetched: 1, allConsumed: 1, arrived at barrier: 1.
963152000: system.systolic_array_acc.weight_fetch3: Fetch queue occupied space: 0 / 32, allFetched: 1, allConsumed: 1, arrived at barrier: 1.
963153000: global: Weight fold barrier, arrived: 1.
963153000: global: Weight fold barrier, arrived: 2.
963153000: global: evaluate
963153000: system.systolic_array_acc.input_fetch0: Fetch queue occupied space: 0 / 32, allFetched: 1, allConsumed: 1, arrived at barrier: 1.
963153000: system.systolic_array_acc.input_fetch1: Fetch queue occupied space: 0 / 32, allFetched: 1, allConsumed: 1, arrived at barrier: 1.
963153000: system.systolic_array_acc.input_fetch2: Fetch queue occupied space: 0 / 32, allFetched: 1, allConsumed: 1, arrived at barrier: 1.
963153000: system.systolic_array_acc.input_fetch3: Fetch queue occupied space: 0 / 32, allFetched: 1, allConsumed: 1, arrived at barrier: 1.
963153000: system.systolic_array_acc.weight_fetch0: Fetch queue occupied space: 0 / 32, allFetched: 1, allConsumed: 1, arrived at barrier: 1.
963153000: system.systolic_array_acc.weight_fetch1: Fetch queue occupied space: 0 / 32, allFetched: 1, allConsumed: 1, arrived at barrier: 1.
963153000: system.systolic_array_acc.weight_fetch2: Fetch queue occupied space: 0 / 32, allFetched: 1, allConsumed: 1, arrived at barrier: 1.
963153000: system.systolic_array_acc.weight_fetch3: Fetch queue occupied space: 0 / 32, allFetched: 1, allConsumed: 1, arrived at barrier: 1.
963154000: global: Weight fold barrier, arrived: 3.
963154000: global: Weight fold barrier, arrived: 4.
963154000: global: evaluate
963154000: system.systolic_array_acc.input_fetch0: Fetch queue occupied space: 0 / 32, allFetched: 1, allConsumed: 1, arrived at barrier: 1.
963154000: system.systolic_array_acc.input_fetch1: Fetch queue occupied space: 0 / 32, allFetched: 1, allConsumed: 1, arrived at barrier: 1.
963154000: system.systolic_array_acc.input_fetch2: Fetch queue occupied space: 0 / 32, allFetched: 1, allConsumed: 1, arrived at barrier: 1.
963154000: system.systolic_array_acc.input_fetch3: Fetch queue occupied space: 0 / 32, allFetched: 1, allConsumed: 1, arrived at barrier: 1.
963154000: system.systolic_array_acc.weight_fetch0: Fetch queue occupied space: 0 / 32, allFetched: 1, allConsumed: 1, arrived at barrier: 1.
963154000: system.systolic_array_acc.weight_fetch1: Fetch queue occupied space: 0 / 32, allFetched: 1, allConsumed: 1, arrived at barrier: 1.
963154000: system.systolic_array_acc.weight_fetch2: Fetch queue occupied space: 0 / 32, allFetched: 1, allConsumed: 1, arrived at barrier: 1.
963154000: system.systolic_array_acc.weight_fetch3: Fetch queue occupied space: 0 / 32, allFetched: 1, allConsumed: 1, arrived at barrier: 1.
963155000: global: Weight fold barrier, arrived: 5.
963155000: global: Weight fold barrier, arrived: 6.
963155000: global: evaluate
963155000: system.systolic_array_acc.input_fetch0: Fetch queue occupied space: 0 / 32, allFetched: 1, allConsumed: 1, arrived at barrier: 1.
963155000: system.systolic_array_acc.input_fetch1: Fetch queue occupied space: 0 / 32, allFetched: 1, allConsumed: 1, arrived at barrier: 1.
963155000: system.systolic_array_acc.input_fetch2: Fetch queue occupied space: 0 / 32, allFetched: 1, allConsumed: 1, arrived at barrier: 1.
963155000: system.systolic_array_acc.input_fetch3: Fetch queue occupied space: 0 / 32, allFetched: 1, allConsumed: 1, arrived at barrier: 1.
963155000: system.systolic_array_acc.weight_fetch0: Fetch queue occupied space: 0 / 32, allFetched: 1, allConsumed: 1, arrived at barrier: 1.
963155000: system.systolic_array_acc.weight_fetch1: Fetch queue occupied space: 0 / 32, allFetched: 1, allConsumed: 1, arrived at barrier: 1.
963155000: system.systolic_array_acc.weight_fetch2: Fetch queue occupied space: 0 / 32, allFetched: 1, allConsumed: 1, arrived at barrier: 1.
963155000: system.systolic_array_acc.weight_fetch3: Fetch queue occupied space: 0 / 32, allFetched: 1, allConsumed: 1, arrived at barrier: 1.
963156000: global: Weight fold barrier, arrived: 7.
963156000: global: Weight fold barrier, arrived: 8.
963156000: global: All have arrived at the weight fold barrier.

As for the bigger SAs, there are two cases. In the base test.c, SAs up to a size of 16 (not included) work as expected. For bigger sizes, I am getting a segmentation fault.

If I want to simulate bigger layers (e.g., ResNet's conv3), for bigger SAs I am getting the following error:

fatal: Streaming out premature data!

How can these errors be fixed?

Yuan did you get a chance to look into this? IIRC we've tested out both 4x4 and 16x16 arrays in the past.

Yuan did you get a chance to look into this? IIRC we've tested out both 4x4 and 16x16 arrays in the past.

Sam, I haven't taken a look. It could be bugs introduced by more recent changes. Will look into this this week.

Hi! Would there be any update regarding this issue?

Sorry for late response. After a bit of investigation, I think for the 4x4 PE configuration, the hang is because of a bug in the commit unit (which collects data from the PEs and writes results to the local SRAM). It currently assumes the number of PE columns is larger than the writeback line size. I will upload a fix for this tomorrow.

@mateo-vm: did this resolve your problems?

Hey Sam, the MRs didn't fix the bug for 16x16 configuration. Sorry been quite busy lately, but will get down to this soon.

@mateo-vm: did this resolve your problems?

Yes, thank you very much. At the end I focused on arrays up to 8x8, so #45 really helped. On my side is solved, but I won't close the issue in case you still want to solve the 16x16 problem.