CGRA4ML: A Framework to Implement Modern Neural Networks for Scientific Edge Computing

cgra4ml is a Python library that helps researchers build, train, and implement their own deep ML models, such as ResNet CNNs, Autoencoders, and Transformers on FPGAs and custom ASIC.

It takes a lot of effort and expertise to implement highly optimized neural networks on edge platforms. The challenging aspects include:

Designing an optimal dataflow architecture
Building & verifying an accelerator, optimizing for high-frequency
Building the System-on-Chip, verifying and optimizing data bottlenecks
Writing C firmware to control the accelerator and verify its correctness

Often, after all that work, the models do not meet their expected performance due to memory bottlenecks and sub-optimal hardware implementation.

We present a highly flexible, high-performance accelerator system that can be adjusted to your needs through a simple Python API. The framework is maintained as open source, allowing a user to modify the processing element to their desired data type using customized architecture, easily expand the architecture to meet the desired performance, and implement new neural network models.

Execution API

#define NDEBUG
#include "platform.h"
#include "deepsocflow_xilinx.h"

int main() {

  hardware_setup();
  xil_printf("Welcome to DeepSoCFlow!\n Store weights, biases & inputs at: %p; \n", &mem.w);

  model_setup();
  model_run();    // run model and measure time

  // Print: outputs & measured time
  Xil_DCacheFlushRange((INTPTR)&mem.y, sizeof(mem.y));  // force transfer to DDR, starting addr & length
  for (int i=0; i<O_WORDS; i++)
    printf("y[%d]: %f \n", i, (float)mem.y[i]);
  printf("Done inference! time taken: %.5f ms \n", 1000.0*(float)(time_end-time_start)/COUNTS_PER_SECOND);

  hardware_cleanup();
  return 0;
}

Motivation

HLS4ML is an open source python framework that's being widely adopted by the scientific community, to generate FPGA & ASIC implementations of their custom Deep Neural Networks. CERN has taped out chips with DNN compression algorithms to be used in LHC using HLS4ML. However, it is not possible to implement deeper neural networks on HLS4ML since it implements one engine per layer in hardware. This project aims to solve that problem and enhance HLS4ML, by creating a statically & dynamically reconfigurable, AXI-Stream DNN engine.

Quick Start

You need either Verilator 5.014+ or XIlinx Vivado for simulation
Clone this repo and install deepsocflow

git clone https://github.com/KastnerRG/cgra4ml
cd cgra4ml
pip install .

Run the example

# Edit SIM and SIM_PATH in the file to match your simulator
cd run/work
python ../example.py

FPGA implementation:

3.1. Generate Bitstream from Vivado:

# Make sure correct fpga board was specified in the above script. Default is ZCU102
# Open Xilinx Vivado, cd into deepsocflow, and type the following in TCL console
cd run/work
source vivado_flow.tcl

3.2. Run on a ZYNQ FPGA:

Open Xilinx Vitis
Create an application project, using .xsa generated by running the run/work/vivado_flow.tcl
Right click on application project -> Properties
- ARM v8 gcc compiler -> Directories -> Add Include Paths: Add absolute paths of run/work and deepsocflow/c
- ARM v8 gcc compiler -> Optimization -> Optimization most (-O3)
- ARM v8 gcc linker -> Libraries -> Add Library: m (math library)
Build, Connect board & launch debug
Add a breakpoint at model_setup(). When breakpoint hits, load run/work/vectors/wbx.bin to the address printed.
Continue - This will run the model and print outputs & execution time

ASIC implementation with Cadence Genus & Innovus:

# First add your PDK to 'asic/pdk', change paths in the scripts and run:
cd run/work
genus -f ../../tcl/asic/run_genus.tcl
innovus
source ../../tcl/asic/pnr.tcl

Framework Infrastructure

Team Members

Aba
Zhenghua