DLR: Deep Learning Routine

'DLR (Deep Learning Routines)' as a part of DPU (Deep Learning Processing Unit) is a collection of high-level synthesizable C/C++ routines for deep learning inference network.

All contents are provided as it is WITHOUT ANY WARRANTY and NO TECHNICAL SUPPORT will be provided for problems that might arise.

Click to expand table of contents

Overview
1.1 Note
1.2 Prerequisites
1.3 Conventions
1.4 How to build
1.5 How to use
Deep-Learning Routines
2.1 Activation
2.1.1. Hyperbolic tangent
2.1.2. Leaky ReLU
2.1.3. ReLU
2.1.4. Sigmoid
2.2 Concatenation
2.2.1 Concat2d
2.3 Convolution
2.3.1 Convolution2d
2.4 Deconvolution
2.4.1 Deconvolution2d
2.5 Linear (Fully connected)
2.6 Normalization
2.6.1 Batch normalization
2.7 Pooling
2.7.1 Average pooling
2.7.2 Max pooling
Projects
3.1 Project: LeNet-5
3.2 Project: Tiny YOLO-V2 for VOC
Other things
4.1 Acknowledgment
4.2 Authors and contributors
4.3 License
4.4 Revision history

1. Overview

'DLR (Deep Learning Routines)' as a part of DPU (Deep Learning Processing Unit) is a collection of high-level synthesizable C/C++ routines for deep learning inference network.

As shown in the picture below, DLR routines are verified in the context of C/C++, Python, PyTorch, and FPGA.


Overall framework

DLR routines contain followings (some are under development and more will be added):

Convolution layer (strict convolution )
Activation layer (ReLU, Leaky ReLU, hyperbolic tangent, sigmoid)
Pooling layer (max, average)
Fully connected layer (linear layer)
De-convolution layer (transposed convolution layer)
Batch normalization layer
Concatenation layer

The routines have following highlights:

Fully synthesizable C code
Highly parameterized to be adopted wide range of usages
C, Python, PyTorch, Verilog test-benches
FPGA verified using Future Design Systems’ CON-FMC

1.1 Note

It should be noted that the routines are not optimized to get a higher performance since the routines are for hardware implementation not for computation. In addition to this the routines are only for inference not for training.

Version 1.4 uses 'dlr' namespace.

1.2 Prerequisites

This program requires followings.

GNU GCC: C compiler
PyTorch / Conda / Python
(Optional for hardware verification and implementation)
- HDL simulator: Xilinx Xsim
- High level synthesis: Xilinx Vivado HLS

1.3 Conventions

This program uses following conventions.

DLR routines are defined within dlr namespace for C++.
Each DLR routine has C++ and C routines, where templates are used for C++.
Python wrappers follow C++ name.
PyTorch wrappers follow PyTorch torch.nn.functional name.

1.4 How to build

Simple do as follows and 'include' and 'lib' directories will be ready to use under 'v1.4' directory.

$ cd v1.4/src
$ make clean
$ make
$ make install

1.5 How to use

You should specify where the DLR header files resiede using "-I" option for g++ compiler, in which $(DLR_HOME) is the path to the deep-learning routines such as /home/user/Deep_Learning_Routines/v1.4.

$ g++ .... -I$(DLR_HOME)/include ....

You should specify where the DLR libaray resides using "-L" and "-l" options for g++ linker.

$ g++ .... -L$(DLR_HOME)/lib -ldlr ....

Refer to the DLR Projects. for more details.

2. Deep-Learning Routines

2.1 Activation

Click to expand this section

2.1.1 Hyperbolic tangent

2.1.2 Leaky ReLU

2.1.3 ReLU

2.1. 4 Sigmoid

2.2 Concatenation

Click to expand this section

2.2.1 Concat2d


Concatenation

2.3 Convolution

Click to expand this section

2.3.1 Convolution2d

'Convolution2d()' applies a 2D convolution over an input data composed of multiple planes (i.e., channels).


COnvolutioin

1.1. C++ routine

C++ routines uses template to support user specific data type.

Click to expand C++ code

template<class TYPE=float>
void Convolution2d
(           TYPE     *out_data    // out_channel x out_size x out_size
    , const TYPE     *in_data     // in_channel x in_size x in_size
    , const TYPE     *kernel      // out_channel x in_channel x kernel_size x kernel_size
    , const TYPE     *bias        // out_channel
    , const uint16_t  out_size    // only for square matrix
    , const uint16_t  in_size     // only for square matrix
    , const uint8_t   kernel_size // only for square matrix
    , const uint16_t  bias_size   // out_channel
    , const uint16_t  in_channel  // number of input channels
    , const uint16_t  out_channel // number of filters (kernels)
    , const uint8_t   stride
    , const uint8_t   padding=0
    #if !defined(__SYNTHESIS__)
    , const int       rigor=0   // check rigorously when 1
    , const int       verbose=0 // verbose level
    #endif
)

1.2. C routine

C routines are build by overloading template of corresponding C++ routine.

Click to expand C code

extern void Convolution2dInt
(           int      *out_data    // out_channel x out_size x out_size
    , const int      *in_data     // in_channel x in_size x in_size
    , const int      *kernel      // in_channel x out_channel x kernel_size x kernel_size
    , const int      *bias        // bias per kernel
    , const uint16_t  out_size    // only for square matrix
    , const uint16_t  in_size     // only for square matrix
    , const uint8_t   kernel_size // only for square matrix
    , const uint16_t  bias_size   // number of biases, it should be the same as out_channel
    , const uint16_t  in_channel  // number of input channels
    , const uint16_t  out_channel // number of filters (kernels)
    , const uint8_t   stride      // stride default 1
    , const uint8_t   padding     // padding default 0
    #if !defined(__SYNTHESIS__)
    , const int       rigor       // check rigorously when 1
    , const int       verbose     // verbose level
    #endif
);
extern void Convolution2dDouble( ); // arguments are the same as Concat2dFloat, but data type is 'double' instead of 'float'
extern void Convolution2dInt( ); // arguments are the same as Concat2dFloat, but data type is 'double' instead of 'int'

1.3. Python wrapper

‘Convolution2d()’ wrapper gets NumPy arguments for array and ‘out_data’ carries calculated result. It returns ‘True’ on success or ‘False’ on failure.

Click to expand Python code

Convolution2d( out_data    # out_channel x out_size x out_size
             , in_data     # in_channel x in_size x in_size
             , kernel      # out_channel x in_channel x kernel_size x kernel_size
             , bias=None   # out_channel
             , stride=1
             , padding=0
             , rigor=False
             , verbose=False)

1.4. PyTorch wrapper

'conv2d()' PyTorch wrapper gets PyTorch tensor arguments for array and returns calculated result. It calls Python wrapper after converting PyTorch tensor to NumPy array.

Click to expand PyTorch code

conv2d( input     # in_minibatch x in_channel x in_size x in_size
      , weight    # out_channel  x in_channel x kernel_size x kernel_size
      , bias=None # out_channel
      , stride=1
      , padding=0
      , dilation=1
      , groups=1
      , rigor=False
      , verbose=False)

2.4 Deconvolution

Click to expand this section

2.4.1 Deconvolution2d


Deconvolution (Transposed convolution)

2.5 Linear (Fully connected)

Linear transfomation on the input data

y = x * W ^T + b

Click to expand this section


Linear one-dimensional


Linear N-dimensional

2.6 Normalization

Click to expand this section

2.6.1 Batch normalization


Batch normaization equation
E[x]: mean, Var[x]: variance, γ: scaling factor, β: bias, ε: numerical stability

:---:
Batch normaization

2.7 Pooling

Click to expand this section

2.7.1 Average pooling

2.7.2 Max pooling

9. Projects

9.1 Project: LeNet-5 for MNIST

LeNet-5 is a popular convolutional neural network architecture for handwritten and machine-printed character recognition.

Details of this project can be found from DLR Projects: LeNet-5.


LeNet-5 project on FPGA

Click to expand this project

1. LeNet-5 network

Following picture shows network structure and its data size.


LeNet-5 network

Following table shows a summary of the network and its parameters.


LeNet-5 network details

2. Design flow

The picture below shows an overall design flow from model training to FPGA implementation.


LeNet-5 design flow

9.2 Project: Tiny YOLO-V2 for VOC

YOLO is a popular convolutional neural network architecture for object detection.

Details of this project can be found from DLR Projects: Tiny YOLO-V2.


Tiny YOLO-V2 project on FPGA

Click to expand this project

4. Other things

4.1 Acknowledgment

This work was supported partially by KARI (Korea Aerospace Research Institute) under the “Development of deep learning hardware accelerator for microsatellite ” (Contract 2019110A35C-00).
Some part of this work was carried out partially by Handong Global University 2020 Summer Internship program.
Some part of this work was supported partially by ETRI (Electronics and Telecommunications Research Institute) under the “Development and training of FPGA-based AI semiconductor design environment ” (Contract EA20202206).

4.2 Authors and contributors

[Ando Ki] - Initial work - Future Design Systems
Seongwon Seo - Future Design Systems
Kwangsub Jeon - Future Design Systems
Chae Eon Lim - Future Design Systems

4.3 License

DLR (Deep Learning Routines) and its associated materials are licensed with the 2-clause BSD license to make the program and library useful in open and closed source projects independent of their licensing scheme. Each contributor holds copyright over their respective contribution.

The 2-Clause BSD License

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

4.4 Revision history

2021.09.04: v1.4 uses namespace 'dlr'
2020.11.12: Released v1.3
2020.03.10: Started by Ando Ki (adki at future-ds.com)

jeonggunlee/Deep_Learning_Routines

DLR: Deep Learning Routine

Table of contents

1. Overview

1.1 Note

1.2 Prerequisites

1.3 Conventions

1.4 How to build

1.5 How to use

2. Deep-Learning Routines

2.1 Activation

2.1.1 Hyperbolic tangent

2.1.2 Leaky ReLU

2.1.3 ReLU

2.1. 4 Sigmoid

2.2 Concatenation

2.2.1 Concat2d

2.3 Convolution

2.3.1 Convolution2d

1.1. C++ routine

1.2. C routine

1.3. Python wrapper

1.4. PyTorch wrapper

2.4 Deconvolution

2.4.1 Deconvolution2d

2.5 Linear (Fully connected)

2.6 Normalization

2.6.1 Batch normalization

2.7 Pooling

2.7.1 Average pooling

2.7.2 Max pooling

9. Projects

9.1 Project: LeNet-5 for MNIST

1. LeNet-5 network

2. Design flow

9.2 Project: Tiny YOLO-V2 for VOC

4. Other things

4.1 Acknowledgment

4.2 Authors and contributors

4.3 License

4.4 Revision history