TERAS

Tera is a unit prefix in the metric system denoting multiplication by one trillion, or 10¹².
Tera is derived from the Greek word τέρας teras, meaning "monster".

Both reasons motivated to name this GEMM accelerator generator "TERAS". Indeed, it can generate arbitrarily large accelerators, hence the monstrous aspect, while delivering >10¹² (tera) floating-point operation per second.

Introduction

TERAS is a command line generator that produces kernel to compute the BLAS level3 routine GEMM aka Matrix-Matrix Multiplication (MMM). It focuses on arithmetic aspects to be able to save power consumption while offering more precise and reproducible results.
The original design targets FPGAs even if it is target agnostic. One of the design choice was a heavy use of flip flops as they are present in good quantity in modern FPGAs ( ~2x than LUTs). That is why there is a paragraph on FPGA evaluation and one for the mpw5 adaptation.

Generator Overview

The kernels rely on 2D (NxM) meshes implemented by Systolic Arrays. Every signal is transferred by the mean of local and distributed connections to improve scalability (see Fig below).

At each clock cycle, N and M real numbers are taken from rows of input Matrix A and columns of input matrix B, rexpectively. Such numbers arrive in a dense form, which is the computer format they are stored in memory before getting translated into "S3" by "A2S3" modules. S3 is a handmade format that allows to represent any incoming computer format while being optimized for hardware internal cells / blocks.

The data in S3 format that represent the coeficients of the matrixes go inside the Processing Elements (PEs). Each PE has the role to do a fused-dot-product (FDP), i.e. a dot-product without any intermediate rounding. Additionally, the PEs pass the data to south and east neighbours. Example in the image below:

The hardware responsible for the fused dot product is S3FDP and is depicted below:

Other peculiarities of this generator comprise:

automated pipeline (flopoco)
An open source framework to generate FLOting POints COres but not only. Flopoco philosophy is that it allows to generate just what is needed for the computation without mimicking general purpose processors floating-points units. Flopoco takes as input, some behavioral hardware description and the couple (freq + target), and then outputs the necessary and sufficient synthetizable VHDL.
In this work, I have made many FloPoCo Operators, including the whole Systolic Array, which allows me to generate any configuration I wish in less than a second. Example below:

./flopoco SystolicArray N=3 M=3 arithmetic_in=posit:8:0 arithmetic_out=same msb_summand=12 lsb_summand=-12 nb_bits_ovf=7 has_HSSD=true chunk_size=-1 frequency=400 target=Kintex7 name=SystolicArray outputFile=SystolicArray.vhdl

*** Final report ***
Output file: SystolicArray.vhdl
Target: Kintex7 @ 400 MHz
|   |---Entity LZOCShifter_6_to_6_counting_8_F400_uid18
|   |      Pipeline depth = 1
|---Entity Arith_to_S3
|      Pipeline depth = 2
|   |---Entity LZOCShifterSticky_32_to_7_counting_64_F400_uid22
|   |      Pipeline depth = 3
|   |---Entity RightShifterSticky8_by_max_8_F400_uid24
|   |      Pipeline depth = 2
|---Entity l2a
|      Pipeline depth = 7
|   |   |   |   |---Entity DSPBlock_6x6_F400_uid35
|   |   |   |   |      Not pipelined
|   |   |   |---Entity IntMultiplier_F400_uid31
|   |   |   |      Not pipelined
|   |   |   |---Entity LeftShifter12_by_max_31_F400_uid38
|   |   |   |      Pipeline depth = 1
|   |   |---Entity s3fdp
|   |   |      Pipeline depth = 2
|   |---Entity PE_S3
|   |      Not pipelined
|---Entity SystolicArrayKernel
|      Not pipelined
Entity SystolicArray
   Not pipelined

HSSD (Half Speed Sink Down)
It is a mechanism I developped to allow the output-stationnary systolic array to output intermediate and final results while still receiving data.
This is important to me because my final setup to evaluate the arrays is with CAPI2, OpenCAPI (capi_wiki), which povide duplex throughputs approaching 20GB/s.
Without HSSD, therefore, without pipelining input and output operations, the 20GB/s would be significantly lowered.
However, as the above code snippet shows, there is possibility to generate the array without HSSD, which creates global routes and big muxes.
HSSD comes at the cost of NM2*size_accumulator Flip Flops.
This is not a problem in the case of modern FPGAs as they contain ~2x FFs more than LUTs. This is a problem for ASIC as FFs are expensive (~30 transistors in sky130).

FPGA evaluation

Following results should have similar trends for ASIC with several less order of magnitude with regard to throughput and performace.

MPW5 and ASIC adaptation

Integration

The first adaptation is regarding the number I/Os. Indeed, teras is not designed to minimize the number of I/Os as it is originally designed for high-throughput and low latency links.
For mpw5, I have connected the inputs rowsA and colsB to the same lsb of wishbone data bus, therefore the operation performed is C=A.A_T (where A_T is the transposed of A).
The output colsC and corresponding valid signals go to the external pins of the chip.

Too much flip flops

Write sky130 target as part of flopoco

A good step to improve flopoco and generation of datapaths for ASIC would be to write the sky130 target as part of flopoco and issue a pull request to their repo.
This step is too long for me at the moment. For instance, I just learnt that 2 gates in ASIC can be faster than 1. One day maybe...

License

This project is licensed under Apache 2

Authors

Louis LEDOUX (Binaryman)

Citing

To cite this work, please refer to the article published in FCCM2022 whose bibtex is shown below

@INPROCEEDINGS{
    ledoux2022,  
    author={Ledoux, Louis and Casas, Marc},
    booktitle={2022 IEEE 30th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)},
    title={A Generator of Numerically-Tailored and High-Throughput Accelerators for Batched GEMMs},
    year={2022},
    doi={10.1109/FCCM53951.2022.9786164}
}

Bynaryman/wrapped_teras