MixNet (Mix_Conv) - 0.360 (0.5) BFlops - 77.0% (71.5%) Top1

Question

MixNet (Mix_Conv) - 0.360 (0.5) BFlops - 77.0% (71.5%) Top1

CuongNguyen218 opened this issue 5 years ago · 51 comments

Hi @AlexeyAB ,
Mix_conv: Mixed Depthwise Convolutional Kernels.
Arxiv
Github
Top1 Acc: 78.9% on ImageNet with 0.56 BFlops. I think this idea is good.

Answer 1 · 2019-11-01T21:22:24.000Z

MixNet-L and -M have the same network architecture: we simple apply depth_multiplier 1.3 on MixNet-M to get MixNet-L, as shown in this code: https://github.com/tensorflow/tpu/blob/56e1058cba2b7b5ca233a4c9bfd7331a69082188/models/official/mnasnet/mixnet/mixnet_builder.py#L217

Is trained:

MixNet-M-GPU - 12.0M params - 0.532 BFlops-FMA - 77.0% (71.5%) Top1 - 93.3% ( 90.5%) Top5
- cfg-file: mixnet_m_gpu.cfg.txt
- weights-file: https://drive.google.com/open?id=1SOLd3eXHwcLkvwFgdiui6uL3-_rWWB1E

Original MixNet-M: mixnet_m.cfg.txt - 0.759 BFlops (0.379 FMA)

GhostNet-1.0 - 5.0M params - 0.119 BFlops - xx.x% Top1 - xx.x% Top5 - MY URL
GhostNet-1.0 - 5.2M params - 0.141 BFlops - 73.9% Top1 - 91.4% Top5 - Official
MobileNetV3 - 5.4M params - 0.219 BFlops - 75.2% Top1 - --- Top5
GhostNet-1.3 - 7.3M params - 0.226 BFlops - 75.7% Top1 - 92.7% Top5 - Official
MixNet-S - 4.1M params - 0.256 BFlops - 75.8% Top1 - 92.8% Top5
MixNet-M - 5.0M params - 0.360 BFlops - 77.0% (71.5%) Top1 - 93.3% ( 90.5%) Top5 - #4203
MixNet-L - 7.3M params - 0.565 BFlops - 78.9% Top1 - 94.2% Top5
EfficientNetB0 - 4.9M params - 0.450 BFlops - 76.3% (71.3%) Top1 - 93.2% (90.4%) Top5 - MY URL
EfficientNetB0 - 5.3M params - 0.390 BFlops - 76.3% (70.0%) Top1 - 93.2% (88.9%) Top5 - Official
EfficientNetB1 - 7.8M params - 0.700 BFlops - 78.8% Top1 - 94.4% Top5 #3380
ShuffleNetV2 - xxxx params - 0.600 BFlops - 75.4% Top1 - xxxx Top5 #3750
Darknet53 - 20.0M params - 18.5 BFlops - 77.2% Top1 - 93.8% Top5
https://github.com/pjreddie/darknet/blob/master/cfg/darknet53.cfg and https://pjreddie.com/darknet/imagenet/

Explanation:

MixNet-M-GPU - is a slightly optimized version of MixNet-M for GPU, it has higher Bflops but also faster on GPU
MixNet-M achieves 77.0% Top1 and EfficientNetB0 achieves 76.3% Top1 only when they are trained with a large mini_batch_size on a large cluster DGX-2 400k$ / GPU/TPU-Cluster 1M$, otherwise official EfficientNetB0 achieves only 70.0% Top1 that is lower than our EfficientNetB0 71.3% Top1 https://github.com/WongKinYiu/CrossStagePartialNetworks#small-models (for example GhostNet-1.0 should be trained with Batch-norm-synchronization on 8 GPUs with mini_batch_size=1024)
To achieve 77.0% Top1 on MixNet-M use Darknet GPU-processing on CPU-RAM: #4386
MixNet-M-GPU has 0.532 BFlops while Darknet shows 1.065 BFlops, that is 2x more. In all papers BFlops is actually FMA_BFlops (2 operations = MUL + ADD) https://en.wikipedia.org/wiki/Multiply%E2%80%93accumulate_operation
Why are there a low amount of BFLOPS in models, but also low speed - in these models a low amount of BFLOPS is achieved by using a grouped/depthwise-convolution, which is very slow on GPU, TPU-edge and other devices.

We replace one of the 15 layers with either (1) vanilla DepthwiseConv9x9 with kernel size 9x9; or (2) MixConv3579 with 4 groups of kernels: {3x3, 5x5, 7x7, 9x9}.
As shown in the figure, large kernel size has different impact on different layers: for most of layers, the accuracy doesn’t change much, but for certain layers with stride 2, a larger kernel can significantly improve the accuracy. Notably, although MixConv3579 uses only half parameters and FLOPS than the vanilla DepthwiseConv9x9, our MixConv achieves similar or slightly better performance for most of the layers.

Depthwise convolution is becoming increasingly popular in modern efficient ConvNets, but its kernel size is often overlooked. In this paper, we systematically study the impact of different kernel sizes, and observe that combining the benefits of multiple kernel sizes can lead to better accuracy and efficiency.

For comparison with EfficientNet

Answer 2 · 2019-11-02T05:05:31.000Z

@AlexeyAB
As I understand, input tensor is split by the number of filters in Mix_Conv. As i see in cfg above, i think you assume that the input channels is 16 and split it by 4 and get 4 tensor with input channels is 4, right?. But I can't understand why you used the route layer is -2, -4, -6. Can you ensure that the input of each convlayer follow the order [0:3] for 3x3, [4 : 8] for 5x5 and so on ?

Answer 3 · 2019-11-05T07:20:40.000Z

@CuongNguyen218 thanks for sharing this.

And yea it seems like AlexeyAB's cfg will apply the filters to the entire input tensor (like inceptionnet).

Answer 4 · 2019-11-07T07:23:43.000Z

Maybe the slice implementation be called but not split @AlexeyAB

Answer 5 · 2019-11-07T11:51:05.000Z

@beHappy666

Original MixNet uses 4 depthwise conv-layers (3x3, 5x5, 7x7, 9x9) instead of 1 depthwise conv-layer
Yes, we can try to implement channel_slice layer as in ShuffleNetV2: #3750
Or may be easier to improve [route] layer:

[route]
layers = -1
group_id=0
groups=4

[convolutional]
batch_normalize=1
filters=4
groups=4
size=3
stride=2
pad=1
activation=leaky

[route]
layers = -3
group_id=1
groups=4

[convolutional]
batch_normalize=1
filters=4
groups=4
size=5
stride=2
pad=1
activation=leaky

[route]
layers = -5
group_id=2
groups=4

[convolutional]
batch_normalize=1
filters=4
groups=4
size=7
stride=2
pad=1
activation=leaky

[route]
layers = -7
group_id=3
groups=4

[convolutional]
batch_normalize=1
filters=4
groups=4
size=9
stride=2
pad=1
activation=leaky

[route]
layers = -1,-3,-5,-7

Answer 6 · 2019-11-07T22:08:11.000Z

I added groups= and groupd_id= params to the [route] layer, so you can try to implement MixNet by using such blocks: #4203 (comment)

But I didn't test it.

Commit: 0fa9c8f

Answer 7 · 2019-11-08T09:03:14.000Z

@AlexeyAB , how can i know that it's true

Answer 8 · 2019-11-11T08:37:50.000Z

@AlexeyAB
Since it is using Depthwise Convolutional. Better to use on CPU.
This must be converted to OpenVino. We have to think about operator fusion.

Answer 9 · 2019-11-11T21:50:12.000Z

@dexception

We have to think about operator fusion.

What is the operator fusion?

Answer 10 · 2019-11-12T21:42:31.000Z

@CuongNguyen218 @dexception @beHappy666 @gnefihs @WongKinYiu @LukeAI

I implemented MixNet-M classification network, so you can try to train it on ImageNet.
It seems it can be fast only on CPU.

GPU nVidia RTX 2070

MixNet-M: mixnet_m.cfg.txt - 0.759 BFlops (0.379 FMA) - 4.6 sec per iteration training - 45ms inference
MixNet-M-XNOR (partially BIT-1 inference): mixnet_m_xnor.cfg.txt - 0.237 BFlops (0.118 FMA) - 5.3 sec per iteration training - 45ms inference (32 BIT-1 ops = 1 Flops)
MixNet-M-GPU (minor modification for GPU): mixnet_m_gpu.cfg.txt - 1.0 BFlops (0.500 FMA) - 2.7 sec per iteration training - 45 ms inference

Answer 11 · 2019-11-12T23:02:02.000Z

@AlexeyAB Hello,

#4203 (comment)

MixNet-S - 4.1M params - 0.256 BFlops - 75.8% Top1 - 92.8% Top5
MixNet-M - 5.0M params - 0.360 BFlops - 77.0% Top1 - 93.3% Top5

#4203 (comment)

MixNet-M - 0.256 BFlops - 4.6 sec per iteration training - 45ms inference
MixNet-M-GPU (minor modification for GPU) - 1.0 BFlops - 2.7 sec per iteration training - 45 ms inference

i d like too know what r difference between these two comments, thanks.

Answer 12 · 2019-11-12T23:08:13.000Z

@WongKinYiu

i d like too know what r difference between these two comments, thanks.

1st is got from paper
2nd actual implementation

Or what do you mean?

MixNet is just more efficient (Top1/Flops) modification of EfficientNet

Answer 13 · 2019-11-12T23:27:26.000Z

just to make sure i understand correctly.

implemented MixNet-M is 0.256 BFLOPs, but GPU version is 1.0 BFLOPs.
and, BFLOPs of implemented MixNet-M is same as MixNet-S in the paper.

i ll take a look cfg files after finish my breakfast, thank you.

Answer 14 · 2019-11-12T23:44:56.000Z

Yes, I just made some changes in MixNet-M (mixnet_m_gpu.cfg.txt) so it can be trained ~2x faster - 2.7 sec instead of 4.6 sec per training iteration with the same inference speed on GPU.
I just decreased groups= in depthwise-MixConv-layers, so it should be more accurate and faster on GPU.

May be we should look at Diagonalwise Refactorization: 15x speedup Depthwise Convolutions to speedup EfficientNet and MixNet: #3908

Answer 15 · 2019-11-13T03:15:33.000Z

Now training mixnet_m.cfg.txt - 0.256 BFlops - 4.6 sec per iteration training - 45ms inference.
But it shows: Total BFLOPS 0.759.

update: gets cuDNN Error: CUDNN_STATUS_INTERNAL_ERROR

Answer 16 · 2019-11-13T09:30:12.000Z

@WongKinYiu Yes, I fixed, BFLOPS 0.759 it is 0.379 FMA (EfficientNet and MixNet authors use FMA).

I successfully trained mixnet_m_gpu.cfg.txt for 10 000 iterations on Windows 7 x64.

Answer 17 · 2019-11-13T10:14:57.000Z

@AlexeyAB thanks,

i do not know why on my every windows computer, training models with group convolution will crash.
on ubuntu, everything works.

Answer 18 · 2019-11-13T10:46:37.000Z

@WongKinYiu

How many iterations did you train before this error occured?
Can you show screenshot of this error?
Try to increase subdivisions.
What CUDA and cuDNN versions do you use?
Show output of

nvcc --version
nvidia-smi

Answer 19 · 2019-11-13T11:10:06.000Z

@AlexeyAB

100~900 iterations.
cuda 10

nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:04_Central_Daylight_Time_2018
Cuda compilation tools, release 10.0, V10.0.130

windows do not have nvidia-smi

Answer 20 · 2019-11-13T11:17:26.000Z

@WongKinYiu

This is a very strange error, why it is trying to create another instance of cuDNN-handle when it is already created.

windows do not have nvidia-smi

It should be in the C:\Program Files\NVIDIA Corporation\NVSMI\nvidia-smi
nvidia-smi.zip

Do you use the latest version of Darknet?
If you set subdivisions=8 does it help?

Answer 21 · 2019-11-13T11:26:50.000Z

yes, i use latest version.

Answer 22 · 2019-11-13T11:52:35.000Z

@WongKinYiu

your nvcc --version shows CUDA 10.0, while nvidia-smi shows CUDA 10.1 - may be this is the reason.
also some users encountered errors when using CUDA 10.1

Answer 23 · 2019-11-13T12:06:51.000Z

yes, i notice that nvidia-smi shows cuda vesion 10.1.
it is really strange.
when i installed cuda, cuda 10.1 had not been released.

Answer 24 · 2019-11-13T12:11:29.000Z

Or just try to use new cuDNN version

Answer 25 · 2019-11-13T15:27:31.000Z

yes, i notice that nvidia-smi shows cuda vesion 10.1.
it is really strange.
when i installed cuda, cuda 10.1 had not been released.

nvidia-smi reports the maximum compatible cuda version for the driver. Not the cuda version installed locally. Yes, it's confusing.

Answer 26 · 2019-11-13T16:47:06.000Z

@LukeAI thanks.

Answer 27 · 2019-11-17T00:49:00.000Z

@AlexeyAB Hello,

MixNet-M: mixnet_m.cfg.txt - 0.759 BFlops (0.379 FMA) - 4.6 sec per iteration training - 45ms inference

MixNet-M-XNOR (partially BIT-1 inference): mixnet_m_xnor.cfg.txt - 0.237 BFlops (0.118 FMA) - 5.3 sec per iteration training - 45ms inference (32 BIT-1 ops = 1 Flops)

MixNet-M-GPU (minor modification for GPU): mixnet_m_gpu.cfg.txt - 1.0 BFlops (0.500 FMA) - 2.7 sec per iteration training - 45 ms inference

Training these three models take too much time, currently...

MixNet-M: epoch ~70,000, loss ~3.9.
MixNet-M-XNOR: epoch ~50,000, loss ~5.5.
MixNet-M-GPU: epoch ~100,000, loss ~3.6.

I will continue training MixNet-M-GPU and stop training another two models.

Answer 28 · 2019-11-17T00:52:01.000Z

@WongKinYiu

What Top1/Top5 accuracy can you get for these 3 models currently?

Answer 29 · 2019-11-17T04:41:59.000Z

@AlexeyAB

MixNet-M: epoch ~70,000, loss ~3.9, top-1 =21.96%, top-5 =45.85%.
MixNet-M-XNOR: epoch ~50,000, loss ~5.5, top-1 =0.1%, top-5 =0.5%.
MixNet-M-GPU: epoch ~100,000, loss ~3.6, top-1 =24.49%, top-5 =49.14%.

Answer 30 · 2019-11-17T15:48:47.000Z

@WongKinYiu Thanks, it seems that xnor=1 isn't suitable for these places where I added it for MixNet-M-XNOR model.

Answer 31 · 2019-11-24T16:07:08.000Z

@WongKinYiu Hi,

I will continue training MixNet-M-GPU and stop training another two models.

MixNet-M-GPU: epoch ~100,000, loss ~3.6, top-1 =24.49%, top-5 =49.14%.

what result did you get?

Answer 32 · 2019-11-24T17:27:32.000Z

@AlexeyAB Hello,
Currently 320k epochs, loss ~= 2.5.
Currently 413k epochs, loss ~= 2.0, top-1 ~= 53.9%, top-5 ~= 79.1%.
Currently 480k epochs, loss ~= 1.7, top-1 ~= 60.4%, top-5 ~= 84.3%.
Currently 546k epochs, loss ~= 1.4, top-1 ~= 65.7%, top-5 ~= 87.1%.
Currently 618k epochs, loss ~= 1.2, top-1 ~= 69.4%, top-5 ~= 89.3%.
Currently 681k epochs, loss ~= 1.0, top-1 ~= 71.0%, top-5 ~= 90.3%.
Currently 736k epochs, loss ~= 1.0, top-1 ~= 71.4%, top-5 ~= 90.5%.
finished , top-1 ~= 71.5%, top-5 ~= 90.5%.

Answer 33 · 2019-11-24T20:58:58.000Z

@WongKinYiu After that MixNet-M-GPU will be trained on ImageNet, I will try to implement EfficientDet - BiFPN (trainable Fussion layer, shared weights in 2-3 last conv-layers before yolo-layer, ...) with MixNet-M-GPU backend: #4346
With [Gaussian_yolo] layer + CIoU-loss.

Answer 34 · 2019-11-25T03:37:16.000Z

@AlexeyAB Thanks,

I will training DIoU and CIoU.
I send an invitation of my private repo to you, then you can see the updated results at the repo.

Answer 35 · 2019-11-25T03:40:58.000Z

@WongKinYiu,
can you give me a link to CIOU and Diou papers ?

Answer 36 · 2019-11-25T03:46:02.000Z

@CuongNguyen218

here u r: #4360

Answer 37 · 2019-11-25T03:55:53.000Z

@AlexeyAB ,
Did you provide EfficientNet model or convert Efficientnet pretrained with ImageNet model to darknet.

Answer 38 · 2019-11-25T03:59:59.000Z

@CuongNguyen218

ImageNet and COCO models of EfficientNet-B0: #3874 (comment)

Answer 39 · 2019-11-27T10:05:11.000Z

@AlexeyAB , what result did you get?

Answer 40 · 2019-12-11T04:01:44.000Z

@AlexeyAB

mixnet-m-gpu, top-1 = 71.5%, top-5 = 90.5%.

Answer 41 · 2019-12-11T09:13:44.000Z

@WongKinYiu Nice! Can you share weights-file?

Answer 42 · 2019-12-11T09:16:48.000Z

Why is your results very different from paper?

Answer 43 · 2019-12-11T09:23:00.000Z

becuz mixnet-m-gpu is designed by @AlexeyAB, not appears in the paper.

Answer 44 · 2019-12-11T09:24:22.000Z

What do you think about pytorch or tensorflow transform model to darknet? Tải Outlook for iOS<https://aka.ms/o0ukef>

…

________________________________ Từ: Kin-Yiu, Wong <notifications@github.com> Đã gửi: Wednesday, December 11, 2019 4:23:01 PM Đến: AlexeyAB/darknet <darknet@noreply.github.com> Cc: Nguyen Ngoc Cuong 20150510 <cuong.nn150510@sis.hust.edu.vn>; Mention <mention@noreply.github.com> Chủ đề: Re: [AlexeyAB/darknet] Mix_Conv (#4203) becuz mixnet-m-gpu is designed by @AlexeyAB<https://github.com/AlexeyAB>, not appears in the paper. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#4203?email_source=notifications&email_token=AKKXRBTGQ7UBX5XUYK4DLV3QYCWPLA5CNFSM4JHVYHBKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGSN37A#issuecomment-564452860>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AKKXRBWWDVSURRCZMKISFMLQYCWPLANCNFSM4JHVYHBA>.

Answer 45 · 2019-12-11T09:26:08.000Z

@AlexeyAB https://drive.google.com/open?id=1SOLd3eXHwcLkvwFgdiui6uL3-_rWWB1E

Answer 46 · 2019-12-11T09:45:34.000Z

@CuongNguyen218

Why is your results very different from paper?

Because in the paper MixNet and Efficientnet are trained with very large mini_batch_size on DGX-2 / Cluster ~400k$ - 1M$.
You can achieve the same accuracy 77.0% Top1 by using Darknet with #4386

If we train with the same mini_batch_size, then EfficientNet-B0 (official) has even lower Top1/5 accuracy than my EfficientNet-B0: https://github.com/WongKinYiu/CrossStagePartialNetworks#small-models

Also, I slightly optimized MixNet on GPU so that it can be trained in 1 month instead of 2 months.

Answer 47 · 2019-12-11T11:33:07.000Z

@CuongNguyen218 If you want you can train original MixNet-M on ImageNet: #4203 (comment)

MixNet-M: mixnet_m.cfg.txt - 0.759 BFlops (0.379 FMA) - 4.6 sec per iteration training - 45ms inference

https://github.com/AlexeyAB/darknet/files/3838329/mixnet_m.cfg.txt

Answer 48 · 2020-03-12T03:01:27.000Z

@AlexeyAB I just started looking into MixConvs. They seem very interesting! Do you know of anywhere that they are applied to object detection or are they only used in classification?

EfficientDet was published in November 2019, while MixConv was published in July 2019, so the EfficientDet authors clearly must have been aware of this type of convolution but neglected to use it for some reason I'm thinking.

Answer 49 · 2020-03-12T09:59:13.000Z

@glenn-jocher

There are the same authors in all three articles: MixNet, EfficientNet, EfficientDet

EfficientNet uses Grouped-Conv
MixNet uses Grouped-Conv with different kernel_size

Both EfficientNet / MixNet are not optimal for the current CPU/GPU/Neuro-chips (MyriadX, Coral-TPU-Edge).

So they do such network as a reference-network to help to create a new neurochips (new version of TPU-edge).

So may be the reason why they don't use MixNet for Detector: Creating a neurochip for EfficientNet (Grouped-conv) is much easier than for MixNet (Grouped-Conv with different kernel_size).

Also may be MixNet has lower BFlops, but also slower.

Answer 50 · 2020-03-12T17:40:07.000Z

@AlexeyAB Ah I see, that's an interesting approach. Yes it seems like hardware speeds for all of these new grouped convolution techniques are quite slow, despite the lower parameter count.

Answer 51 · 2020-05-03T09:58:11.000Z

Hi @AlexeyAB , I am trying to do inference on mixnet model using your config and pretrained weights mentioned in the starting of the tread, but I am getting error: " Error: in the file data/coco.names number of names 80 that isn't equal to classes=0 in the file cfg/mixnet_m_gpu.cfg
". The number of classes is not mentioned in the config file, but this error says so. And even if it implies that it was trained on a different number of classes, it still does not makes sense to have 0 classes in a config file. Am I missing something here? Someone help me out.

I tried running it on my ubuntu 18.04 by using command: "./darknet detector test cfg/coco.data cfg/mixnet_m_gpu.cfg mixnet_m_gpu_final.weights -ext_output data/dog.jpg"