Charrin/RetinaFace-Cpp

About arm platform (ncnn model)

hanson-young opened this issue · 16 comments

Do you have any plans to transfer to ncnn on arm platform? I failed to convert ncnn with caffe model which you provide.

:~/Documents/3rdpart/ncnn/build/tools/caffe$ ./caffe2ncnn ./mnet.prototxt ./mnet.prototxt.caffemodel ./retina.param ./retina.bin
Segmentation fault (core dumped)

Do you have any plans to transfer to ncnn on arm platform? I failed to convert ncnn with caffe model which you provide.

:~/Documents/3rdpart/ncnn/build/tools/caffe$ ./caffe2ncnn ./mnet.prototxt ./mnet.prototxt.caffemodel ./retina.param ./retina.bin
Segmentation fault (core dumped)

It is caused by the empty of deconv layer weight, Caffe will init new weight, but NCNN not.
I have solved this problem, you can update the new mnet model.
Can you provide the test speed on ARM platform? Thank you!

Thanks a lot, I will provide it!

@Charrin
I have solved the inference and post-processing on ncnn !But multi-threading does not improve performance
https://github.com/hanson-young/RetinaFace-Cpp/blob/master/retinaface_ncnn/images/result.jpg

qcom835 640*480 VGA(only inference)

130|greatqltechn:/data/local/tmp $ ./benchncnn 4 1 0 0                         
loop_count = 4
num_threads = 1
powersave = 0
gpu_device = 0
 retinaface-mnet0.25  min =  130.10  max =  131.61  avg =  130.94
       mobilefacenet  min =   48.79  max =   49.55  avg =   49.21
  mobilefacenet-int8  min =   47.21  max =   48.11  avg =   47.76
          squeezenet  min =   63.86  max =   65.61  avg =   64.63
     squeezenet-int8  min =   49.12  max =   49.65  avg =   49.36
           mobilenet  min =  110.70  max =  112.14  avg =  111.47
      mobilenet-int8  min =   88.56  max =   89.66  avg =   89.31
        mobilenet_v2  min =   80.85  max =   82.40  avg =   81.81

@Charrin
I have solved the inference and post-processing on ncnn !But multi-threading does not improve performance
https://github.com/hanson-young/RetinaFace-Cpp/blob/master/retinaface_ncnn/images/result.jpg

qcom835 640*480 VGA(only inference)

130|greatqltechn:/data/local/tmp $ ./benchncnn 4 1 0 0                         
loop_count = 4
num_threads = 1
powersave = 0
gpu_device = 0
 retinaface-mnet0.25  min =  130.10  max =  131.61  avg =  130.94
       mobilefacenet  min =   48.79  max =   49.55  avg =   49.21
  mobilefacenet-int8  min =   47.21  max =   48.11  avg =   47.76
          squeezenet  min =   63.86  max =   65.61  avg =   64.63
     squeezenet-int8  min =   49.12  max =   49.65  avg =   49.36
           mobilenet  min =  110.70  max =  112.14  avg =  111.47
      mobilenet-int8  min =   88.56  max =   89.66  avg =   89.31
        mobilenet_v2  min =   80.85  max =   82.40  avg =   81.81

Thank you! I update your test result into my README

nihui commented

@hanson-young hi, very appreciated for your work!

model graph is not optimal i think, thus you can try this ~
https://github.com/Tencent/ncnn/wiki/model-optimize

@hanson-young hi, very appreciated for your work!

model graph is not optimal i think, thus you can try this ~
https://github.com/Tencent/ncnn/wiki/model-optimize

I have tried, it speeds up 10% on qcom 625
1-thread 379ms
2-thread 244ms
4-thread 180ms

@nihui 感谢nihui大佬,问题已经解决了,cmake3.9.2在用ndk编译的时候调用openmp会出问题,我降低版本到3.5.1就好了https://gitlab.kitware.com/cmake/cmake/issues/17351
@Charrin 以下是我测试的结果:
高通835 VGA(640*480)

greatqltechn:/data/local/tmp $ ./benchncnn 4 4 0                                                                                                                                                                  
loop_count = 4
num_threads = 4
powersave = 0
gpu_device = -1
 retinaface-mnet0.25  min =   62.31  max =   63.49  avg =   62.79
retinaface-mnet0.25_opt  min =   67.09  max =   82.76  avg =   75.99
       mobilefacenet  min =   15.89  max =   16.32  avg =   16.09
   mobilefacenet_opt  min =   14.12  max =   14.59  avg =   14.42
  mobilefacenet_int8  min =   16.11  max =   16.45  avg =   16.26
          squeezenet  min =   22.76  max =   26.53  avg =   23.94
     squeezenet_int8  min =   18.77  max =   19.20  avg =   18.99
           mobilenet  min =   34.43  max =   34.91  avg =   34.66
      mobilenet_int8  min =   28.90  max =   31.59  avg =   30.00
130|greatqltechn:/data/local/tmp $ ./benchncnn 4 2 0                                                                                                                                                              
loop_count = 4
num_threads = 2
powersave = 0
gpu_device = -1
 retinaface-mnet0.25  min =   82.75  max =   83.10  avg =   82.97
retinaface-mnet0.25_opt  min =   73.44  max =   75.41  avg =   74.52
       mobilefacenet  min =   28.08  max =   30.48  avg =   28.97
   mobilefacenet_opt  min =   25.23  max =   25.98  avg =   25.54
  mobilefacenet_int8  min =   29.37  max =   29.91  avg =   29.69
          squeezenet  min =   35.18  max =   38.03  avg =   36.80
     squeezenet_int8  min =   29.45  max =   31.90  avg =   30.67
           mobilenet  min =   58.60  max =   59.68  avg =   59.17
      mobilenet_int8  min =   51.27  max =   52.94  avg =   51.73
130|greatqltechn:/data/local/tmp $ ./benchncnn 4 1 0                                                                                                                                                              
loop_count = 4
num_threads = 1
powersave = 0
gpu_device = -1
 retinaface-mnet0.25  min =  136.17  max =  138.68  avg =  137.37
retinaface-mnet0.25_opt  min =  123.71  max =  127.71  avg =  125.10
       mobilefacenet  min =   51.50  max =   53.77  avg =   52.40
   mobilefacenet_opt  min =   46.99  max =   47.81  avg =   47.52
  mobilefacenet_int8  min =   56.54  max =   58.16  avg =   57.55
          squeezenet  min =   64.10  max =   65.19  avg =   64.77
     squeezenet_int8  min =   51.01  max =   51.62  avg =   51.42
           mobilenet  min =  107.86  max =  111.64  avg =  109.71
      mobilenet_int8  min =   98.07  max =   98.55  avg =   98.30

@hanson-young have you compare the speed of arm platform inference for Retinaface vs. MTCNN model?

@pineking It’s hard to say, related to specific platforms and uses

@hanson-young my test time on 835 is about 20 ms slower than your inference time,could you share your NCNN lib and include files for android?thank you very much!!

@hanjw123 I compiled it on May 29, but you can get ncnn lib from here. https://github.com/Tencent/ncnn/releases.

@hanjw123 I compiled it on May 29, but you can get ncnn lib from here. https://github.com/Tencent/ncnn/releases.
@hanson-young ok! I tried an older version and it really speeds up, thank you very much!

@hanson-young my inference result is wrong..what's your ndk version and ANDROID_PLATFORM version?

@hanson-young my inference result is wrong..what's your ndk version and ANDROID_PLATFORM version?

@hanjw123 HI,I also test the speed of retinaface model, would you like to discuss this together?
my wechat is pineking

@pineking I run it on arm arch64,not android application

测试了一下caffe的mnet模型在树莓派4B上的速度,推理框架用的阿里的MNN,树莓派4B的cpu型号是BCM2711(四核Cortex A72,主频1.5GHz),测试分辨率为VGA (640*480),loop10次取平均:

核心数 fp32计算耗时(ms) 量化后int8计算耗时(ms)
1 167 183
2 116 102
3 105 76
4 96 61