TinyMaix

中文 | English

TinyMaix is a tiny inference Neural Network library specifically for microcontrollers (TinyML).
We design it follow the rule: Easy-to-Use > Portable > Speed > Space

Introduction to tinyML: TinyML
See tested 47 chips and benchmark: benchmark
Good News: Rewarded Porting TinyMaix

Highlights

Core Code less than 400 lines(tm_layers.c+tm_model.c+arch_cpu.h), code .text section less than 3KB
Low ram consume, even Arduino ATmega328 (32KB Flash, 2KB Ram) can run mnist with TinyMaix~
Support INT8/FP32/FP16 model, experimentally support FP8, convert from keras h5 or tflite.
Supoort multi architecture accelerate: ARM SIMD/NEON/MVEI，RV32P, RV64V, CSKYV2, X86 SSE2 ~
User-friendly interfaces, just load/run models~
Support Full Static Memory config
MaixHub Online Model Training support

Run mnist demo on Arduino ATmega328

mnist demo
0000000000000000000000000000
0000000000000000000000000000
0000000000000000000000000000
000000000077AFF9500000000000
000000000AFFFFFFD10000000000
00000000AFFFD8BFF70000000000
00000003FFD2000CF80000000000
00000004FD10007FF40000000000
00000000110000DFF40000000000
00000000000007FFC00000000000
0000000000004FFE300000000000
0000000000008FF9000000000000
00000000000BFF90000000000000
00000000001EFE20000000000000
0000000000CFF800000000000000
0000000004FFB000000000000000
000000001CFF8000000000000000
000000008FFA0000000000000000
00000000FFF10000000000000000
00000000FFF21111000112999900
00000000FFFFFFFFA8AFFFFFFF70
00000000AFFFFFFFFFFFFFFA7730
0000000007777AFFF97720000000
0000000000000000000000000000
0000000000000000000000000000
0000000000000000000000000000
0000000000000000000000000000
0000000000000000000000000000
===use 49912us
0: 0
1: 0
2: 89
3: 0
4: 1
5: 6
6: 1
7: 0
8: 0
9: 0
### Predict output is: Number 2, prob=89

TODO List

~~optimize tm_layers.c to tm_layers_O1.c, aimed to speed up to 1.4~2.0X~~ Done
~~Add "ADD" OPS to support resnet/mbnet v2 ~~ Done!
Train good backbone for 64KB/128KB/256KB/512KB ram litmit
Add example: Detector,KWS,HAR,Gesture,OCR,...
...

Do you want take participate in development of TinyMaix, or discuss with TinyML hobbyist?
Join our telegram group: https://t.me/tinymaix

TinyMaix Design

TinyMaix is design for running AI Neural Network Mdoels on resources limited MCUs, which usually called TinyML

There are many TinyML infer library now, like TFLite micro, microTVM, NNoM, so why we need TinyMaix?

TinyMaix is a weekend hackathons project, so it is simple enough to read though in 30 minutes, and it will help TinyML newbies to understand how is it running.

TinyMaix aims to be a simple TinyML infererence library, it abandon many new features and doesn't use libs like CMSIS-NN.

Following this design goal, now TinyMaix is as simple as 5 files to compile~

We hope TinyMaix can help any MCU run AI Neural Network Mdoels, every one can port it to theirself hardware platform~

Note: Although TinyMaix support multi architecture accelerate, but it still need more effort to balance size and speed.

Features in design

Support up to mobilenet v1, RepVGG backbone
- they are most common used, efficient structure for MCUs
- Basic Conv2d, dwConv2d, FC, Relu/Relu6/Softmax, GAP, Reshape
- ADD (for resnet,mbnet v2 struct, experimental)
- MaxPool, AvgPool (now use stride instead)
FP32 model, INT8 quant model, FP16 model(NEW)
Convert tmdl from keras h5 or tflite
- model is simple enough to train with keras/tf
- tflite have quant functions already
Model statistics functions in C
- Optional for reduce code

Features maybe added

INT16 quant model
- Advantages:
  - more accuracy
  - friendly for SIMD/RV32P accelerate
- Disadvantages:
  - increase FLASH/RAM consume 2X
Concat OP
- Advantages:
  - support mobilenet v2, more accuracy
- Disadvantages:
  - increase RAM consume 2X
  - concat mat cost many time cause model infer slow
  - need more work to cvt model into flat structure (in script)
Winograd Convolution Optimization
- Advantages:
  - may speed up Conv computing
- Disadvantages:
  - increase RAM consume, and consume more memory bandwidth
  - increase code (.text) size
  - need many Transforms, weak MCU may cost many time here

Features won't be added

BF16 model
- most MCU don't have BF16 computing ability
- accuracy won't better than INT16 to much
- increase FLASH/RAM consume 2X
AVX/vulkan acceleration
- TinyMaix is for MCUs, not for powerful PC/mobilephones
other misc OPs
- TinyMaix support MCUs to run basic model in minimum resource consumption, if you want more OPs, switch to TFlite-micro/TVM/NCNN...

Try Demos

mnist

MNIST is handwritten digit recognition task, it is simple enough for even 8bit MCU like ATmega328.
Try it on PC:

cd examples/mnist
mkdir build
cd build 
cmake ..
make
./mnist

mbnet

mbnet (mobilenet v1) is simple classification model for mobile devices, but it is still a little heavy for MCUs.
The model in demo is mobilenet v1 0.25, it input 128x128x3 RGB image, output 1000 classes predict.
It need at least 128KB SRAM and 512KB Flash, STM32F411 is the typical minimum config for this model.

Try run mobilenet

cd examples/mbnet
mkdir build
cd build 
cmake ..
make
./mbnet

How to use (API)

Load Model

tm_err_t tm_load (tm_mdl_t* mdl, const uint8_t* bin, uint8_tbuf, tm_cb_t cb, tm_mat_t in);

mdl: model handle;
bin: model bin buf;
buf: main buf for middle output; if NULL, auto malloc main buf; else, use your static buffer.
cb: layer callback;
in: return input mat, include buf addr; //you can ignore it if use static buf

Remove Model

void tm_unload(tm_mdl_t* mdl);

Preprocess Input Data

tm_err_t tm_preprocess(tm_mdl_t* mdl, tm_pp_t pp_type, tm_mat_t* in, tm_mat_t* out);
TMPP_FP2INT //user own fp buf -> int input buf
TMPP_UINT2INT //int8: cvt in place; int16: can't cvt in place
TMPP_UINT2FP01 // u8/255.0
TMPP_UINT2FPN11// (u8-128)/128

Run Model

tm_err_t tm_run (tm_mdl_t* mdl, tm_mat_t* in, tm_mat_t* out);

How to port

The core file is those 5 files: tm_model.c, tm_layers.c, tinymaix.h, tm_port.h, arch_xxx.h

If you are using normal mcu without any acceleration instructions, choose arch_cpu.h, otherwise choose corresponding architecture header.

And you should edit tm_port.h to fill your desired configs, all config macro have annotation follow it.

Note TM_MAX_CSIZE,TM_MAX_KSIZE,TM_MAX_KCSIZE will occupy static buffers.

And now just put them into your project, compile it~

How to train/convert models

There are training scripts in examples/mnist to learn how to train simple mnist models.

Note: you need install TensorFlow (>=2.7) first.

After training and save h5 models, you can use scripts in tools to convert to tmdl or c header files.

h5_to_tflite.py
convert h5 model to float or int8 quant tflite files
python3 h5_to_tflite.py h5/mnist.h5 tflite/mnist_f.tflite 0
python3 h5_to_tflite.py h5/mnist.h5 tflite/mnist_q.tflite 1 quant_img_mnist/ 0to1
tflite2tmdl.py
convert tflite file to tmdl or c header files.
python3 tflite2tmdl.py tflite/mnist_q.tflite tmdl/mnist_q.tmdl int8 1 28,28,1 10

================ pack model head ================
mdl_type   =0
out_deq    =1
input_cnt  =1
output_cnt =1
layer_cnt  =6
buf_size   =1464
sub_size   =0
in_dims    = [3, 28, 28, 1]
out_dims   = [1, 1, 1, 10]
================   pack layers   ================
CONV_2D
    [3, 28, 28, 1] [3, 13, 13, 4]
    in_oft:0, size:784;  out_oft:784, size:680
    padding valid
    layer_size=152
CONV_2D
    [3, 13, 13, 4] [3, 6, 6, 8]
    in_oft:784, size:680;  out_oft:0, size:288
    padding valid
    layer_size=432
CONV_2D
    [3, 6, 6, 8] [3, 2, 2, 16]
    in_oft:0, size:288;  out_oft:1400, size:64
    padding valid
    layer_size=1360
MEAN
    [3, 2, 2, 16] [1, 1, 1, 16]
    in_oft:1400, size:64;  out_oft:0, size:16
    layer_size=48
FULLY_CONNECTED
    [1, 1, 1, 16] [1, 1, 1, 10]
    in_oft:0, size:16;  out_oft:1448, size:16
    layer_size=304
SOFTMAX
    [1, 1, 1, 10] [1, 1, 1, 10]
    OUTPUT!
    in_oft:1448, size:16;  out_oft:0, size:56
    layer_size=48
================    pack done!   ================
    model  size 2.4KB (2408 B) FLASH
    buffer size 1.4KB (1464 B) RAM
    single layer mode subbuff size 1.4KB (64+1360=1424 B) RAM
Saved to tmdl/mnist_q.tmdl, tmdl/mnist_q.h

Now you have tmdl or c header files, put it into your project to use it~

How to train models online with MaixHub

You can download models from MaixHub or train your AI models online easily with MaixHub, don't need AI knowledge, train your model just click your mouse.

Register MaixHub account and login.
You can download TinyMaix models from model zoo or upload your models to model zoo for sharing.
Create a train project, collect dataset and train models online, finally you will get files:
- .tmdl file and .h file, use one of them in your code.
- report.json, report info, json format, we can find labels or anchors in this file, we will use these params in our code. Attention, these params will change in every training, you should copy these params to your code when change model, or you will the result will be wrong.

There's two type: classification and detection, for first time usage, use classification is recommended.

There's many backbone, you should select proper backbone according to your MCU's RAM size, the smaller RAM size, should choose the smaller backbone.

For easier understanding how MaixHub works, at first time you can choose tfjs platform instead of tinymaix to run model on your mobile phone.

Find demo in examples folder, use the maixhub_image_classification demo or maixhub_image_detection demo to run your model.

How to add new platform acceleration code

For new platforms, you just need add arch_xxx.h to src dir, and implement functions inside.
Here is the main functions you need implement (sort by importance):

a. TM_INLINE void tm_dot_prod(mtype_t* sptr, mtype_t* kptr,uint32_t size, sumtype_t* result)
	implement platform's dot product functions, usually use MAC related instructions. 

b. TM_INLINE void tm_dot_prod_pack2(mtype_t* sptr, mtype_t* kptr, uint32_t size, sumtype_t* result)
	implement platform's dual channel dot product functions  
  (not 4 or more channel, because some chip platform's register is not enough to support more channels)

c. TM_INLINE void tm_postprocess_sum(int n, sumtype_t* sums, btype_t* bs, int act, mtype_t* outp, sctype_t* scales, sctype_t out_s, zptype_t out_zp)
	implement platform's batch postprocess functions, note n is power of 2.

d. TM_INLINE void tm_dot_prod_3x3x1(mtype_t* sptr, mtype_t* kptr, sumtype_t* result)
	implement platform 3x3 dot product. mostly use handwrite cpu code.
  
e. TM_INLINE void tm_dot_prod_gap_3x3x1(mtype_t* sptr, mtype_t* kptr, uint32_t* k_oft, sumtype_t* result)
	implement platform 3x3 gap dot product.

...

Contribution & Contacts

If you want contribute functions to TinyMaix, please read "TinyMaix Design" sections, we only want functions in "Features in design" and "Features maybe added".

If you want commit your port test result, please commit to benchmark.md. You are welcome to port TinyMaix to your chip/boards, it will prove how easy to use TinyMaix run Deeplearning model in MCUs~

If you have question with TinyMaix usage/porting, please feedback Issues in this repo.

If you have bussiness project consulting or private questions, you can send mail to support@sipeed.com or zepan@sipeed.com (Caesar Wu).

Gu-Yi8023/TinyMaix