AlexeyAB/darknet

MixNet (Mix_Conv) - 0.360 (0.5) BFlops - 77.0% (71.5%) Top1

CuongNguyen218 opened this issue Β· 51 comments

Hi @AlexeyAB ,
Mix_conv: Mixed Depthwise Convolutional Kernels.
Arxiv
Github
Top1 Acc: 78.9% on ImageNet with 0.56 BFlops. I think this idea is good.

MixNet-L and -M have the same network architecture: we simple apply depth_multiplier 1.3 on MixNet-M to get MixNet-L, as shown in this code: https://github.com/tensorflow/tpu/blob/56e1058cba2b7b5ca233a4c9bfd7331a69082188/models/official/mnasnet/mixnet/mixnet_builder.py#L217

Is trained:



  • GhostNet-1.0 - 5.0M params - 0.119 BFlops - xx.x% Top1 - xx.x% Top5 - MY URL
  • GhostNet-1.0 - 5.2M params - 0.141 BFlops - 73.9% Top1 - 91.4% Top5 - Official
  • MobileNetV3 - 5.4M params - 0.219 BFlops - 75.2% Top1 - --- Top5
  • GhostNet-1.3 - 7.3M params - 0.226 BFlops - 75.7% Top1 - 92.7% Top5 - Official
  • MixNet-S - 4.1M params - 0.256 BFlops - 75.8% Top1 - 92.8% Top5
  • MixNet-M - 5.0M params - 0.360 BFlops - 77.0% (71.5%) Top1 - 93.3% ( 90.5%) Top5 - #4203
  • MixNet-L - 7.3M params - 0.565 BFlops - 78.9% Top1 - 94.2% Top5
  • EfficientNetB0 - 4.9M params - 0.450 BFlops - 76.3% (71.3%) Top1 - 93.2% (90.4%) Top5 - MY URL
  • EfficientNetB0 - 5.3M params - 0.390 BFlops - 76.3% (70.0%) Top1 - 93.2% (88.9%) Top5 - Official
  • EfficientNetB1 - 7.8M params - 0.700 BFlops - 78.8% Top1 - 94.4% Top5 #3380
  • ShuffleNetV2 - xxxx params - 0.600 BFlops - 75.4% Top1 - xxxx Top5 #3750
  • Darknet53 - 20.0M params - 18.5 BFlops - 77.2% Top1 - 93.8% Top5
    https://github.com/pjreddie/darknet/blob/master/cfg/darknet53.cfg and https://pjreddie.com/darknet/imagenet/

Explanation:

  • MixNet-M-GPU - is a slightly optimized version of MixNet-M for GPU, it has higher Bflops but also faster on GPU

  • MixNet-M achieves 77.0% Top1 and EfficientNetB0 achieves 76.3% Top1 only when they are trained with a large mini_batch_size on a large cluster DGX-2 400k$ / GPU/TPU-Cluster 1M$, otherwise official EfficientNetB0 achieves only 70.0% Top1 that is lower than our EfficientNetB0 71.3% Top1 https://github.com/WongKinYiu/CrossStagePartialNetworks#small-models (for example GhostNet-1.0 should be trained with Batch-norm-synchronization on 8 GPUs with mini_batch_size=1024)
    To achieve 77.0% Top1 on MixNet-M use Darknet GPU-processing on CPU-RAM: #4386

  • MixNet-M-GPU has 0.532 BFlops while Darknet shows 1.065 BFlops, that is 2x more. In all papers BFlops is actually FMA_BFlops (2 operations = MUL + ADD) https://en.wikipedia.org/wiki/Multiply%E2%80%93accumulate_operation

  • Why are there a low amount of BFLOPS in models, but also low speed - in these models a low amount of BFLOPS is achieved by using a grouped/depthwise-convolution, which is very slow on GPU, TPU-edge and other devices.


We replace one of the 15 layers with either (1) vanilla DepthwiseConv9x9 with kernel size 9x9; or (2) MixConv3579 with 4 groups of kernels: {3x3, 5x5, 7x7, 9x9}.
As shown in the figure, large kernel size has different impact on different layers: for most of layers, the accuracy doesn’t change much, but for certain layers with stride 2, a larger kernel can significantly improve the accuracy. Notably, although MixConv3579 uses only half parameters and FLOPS than the vanilla DepthwiseConv9x9, our MixConv achieves similar or slightly better performance for most of the layers.

Depthwise convolution is becoming increasingly popular in modern efficient ConvNets, but its kernel size is often overlooked. In this paper, we systematically study the impact of different kernel sizes, and observe that combining the benefits of multiple kernel sizes can lead to better accuracy and efficiency.

mixnet-flops


image


For comparison with EfficientNet

59429215-fb9f6580-8de7-11e9-9b6d-63ff4bddd897


image


image

@AlexeyAB
As I understand, input tensor is split by the number of filters in Mix_Conv. As i see in cfg above, i think you assume that the input channels is 16 and split it by 4 and get 4 tensor with input channels is 4, right?. But I can't understand why you used the route layer is -2, -4, -6. Can you ensure that the input of each convlayer follow the order [0:3] for 3x3, [4 : 8] for 5x5 and so on ?

@CuongNguyen218 thanks for sharing this.

And yea it seems like AlexeyAB's cfg will apply the filters to the entire input tensor (like inceptionnet).

Maybe the slice implementation be called but not split @AlexeyAB

@beHappy666

  • Original MixNet uses 4 depthwise conv-layers (3x3, 5x5, 7x7, 9x9) instead of 1 depthwise conv-layer

  • Yes, we can try to implement channel_slice layer as in ShuffleNetV2: #3750

  • Or may be easier to improve [route] layer:

[route]
layers = -1
group_id=0
groups=4

[convolutional]
batch_normalize=1
filters=4
groups=4
size=3
stride=2
pad=1
activation=leaky

[route]
layers = -3
group_id=1
groups=4

[convolutional]
batch_normalize=1
filters=4
groups=4
size=5
stride=2
pad=1
activation=leaky

[route]
layers = -5
group_id=2
groups=4

[convolutional]
batch_normalize=1
filters=4
groups=4
size=7
stride=2
pad=1
activation=leaky

[route]
layers = -7
group_id=3
groups=4

[convolutional]
batch_normalize=1
filters=4
groups=4
size=9
stride=2
pad=1
activation=leaky

[route]
layers = -1,-3,-5,-7

I added groups= and groupd_id= params to the [route] layer, so you can try to implement MixNet by using such blocks: #4203 (comment)

But I didn't test it.

Commit: 0fa9c8f

@AlexeyAB , how can i know that it's true

@AlexeyAB
Since it is using Depthwise Convolutional. Better to use on CPU.
This must be converted to OpenVino. We have to think about operator fusion.

@dexception

We have to think about operator fusion.

What is the operator fusion?

@CuongNguyen218 @dexception @beHappy666 @gnefihs @WongKinYiu @LukeAI

I implemented MixNet-M classification network, so you can try to train it on ImageNet.
It seems it can be fast only on CPU.

GPU nVidia RTX 2070

  • MixNet-M: mixnet_m.cfg.txt - 0.759 BFlops (0.379 FMA) - 4.6 sec per iteration training - 45ms inference

  • MixNet-M-XNOR (partially BIT-1 inference): mixnet_m_xnor.cfg.txt - 0.237 BFlops (0.118 FMA) - 5.3 sec per iteration training - 45ms inference (32 BIT-1 ops = 1 Flops)

  • MixNet-M-GPU (minor modification for GPU): mixnet_m_gpu.cfg.txt - 1.0 BFlops (0.500 FMA) - 2.7 sec per iteration training - 45 ms inference

@AlexeyAB Hello,

#4203 (comment)

  • MixNet-S - 4.1M params - 0.256 BFlops - 75.8% Top1 - 92.8% Top5
  • MixNet-M - 5.0M params - 0.360 BFlops - 77.0% Top1 - 93.3% Top5

#4203 (comment)

  • MixNet-M - 0.256 BFlops - 4.6 sec per iteration training - 45ms inference
  • MixNet-M-GPU (minor modification for GPU) - 1.0 BFlops - 2.7 sec per iteration training - 45 ms inference

i d like too know what r difference between these two comments, thanks.

@WongKinYiu

i d like too know what r difference between these two comments, thanks.

1st is got from paper
2nd actual implementation

Or what do you mean?

MixNet is just more efficient (Top1/Flops) modification of EfficientNet

just to make sure i understand correctly.

implemented MixNet-M is 0.256 BFLOPs, but GPU version is 1.0 BFLOPs.
and, BFLOPs of implemented MixNet-M is same as MixNet-S in the paper.

i ll take a look cfg files after finish my breakfast, thank you.

Yes, I just made some changes in MixNet-M (mixnet_m_gpu.cfg.txt) so it can be trained ~2x faster - 2.7 sec instead of 4.6 sec per training iteration with the same inference speed on GPU.
I just decreased groups= in depthwise-MixConv-layers, so it should be more accurate and faster on GPU.

May be we should look at Diagonalwise Refactorization: 15x speedup Depthwise Convolutions to speedup EfficientNet and MixNet: #3908

Now training mixnet_m.cfg.txt - 0.256 BFlops - 4.6 sec per iteration training - 45ms inference.
But it shows: Total BFLOPS 0.759.

image

update: gets cuDNN Error: CUDNN_STATUS_INTERNAL_ERROR

@WongKinYiu Yes, I fixed, BFLOPS 0.759 it is 0.379 FMA (EfficientNet and MixNet authors use FMA).

I successfully trained mixnet_m_gpu.cfg.txt for 10 000 iterations on Windows 7 x64.

@AlexeyAB thanks,

i do not know why on my every windows computer, training models with group convolution will crash.
on ubuntu, everything works.

@WongKinYiu

  • How many iterations did you train before this error occured?
  • Can you show screenshot of this error?
  • Try to increase subdivisions.
  • What CUDA and cuDNN versions do you use?
  • Show output of
nvcc --version
nvidia-smi

@AlexeyAB

100~900 iterations.
cuda 10
image

nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:04_Central_Daylight_Time_2018
Cuda compilation tools, release 10.0, V10.0.130

windows do not have nvidia-smi

@WongKinYiu

This is a very strange error, why it is trying to create another instance of cuDNN-handle when it is already created.

windows do not have nvidia-smi

It should be in the C:\Program Files\NVIDIA Corporation\NVSMI\nvidia-smi
nvidia-smi.zip

Do you use the latest version of Darknet?
If you set subdivisions=8 does it help?

yes, i use latest version.

image

@WongKinYiu

  • your nvcc --version shows CUDA 10.0, while nvidia-smi shows CUDA 10.1 - may be this is the reason.
  • also some users encountered errors when using CUDA 10.1

yes, i notice that nvidia-smi shows cuda vesion 10.1.
it is really strange.
when i installed cuda, cuda 10.1 had not been released.

Or just try to use new cuDNN version

yes, i notice that nvidia-smi shows cuda vesion 10.1.
it is really strange.
when i installed cuda, cuda 10.1 had not been released.

nvidia-smi reports the maximum compatible cuda version for the driver. Not the cuda version installed locally. Yes, it's confusing.

@LukeAI thanks.

@AlexeyAB Hello,

  • MixNet-M: mixnet_m.cfg.txt - 0.759 BFlops (0.379 FMA) - 4.6 sec per iteration training - 45ms inference
  • MixNet-M-XNOR (partially BIT-1 inference): mixnet_m_xnor.cfg.txt - 0.237 BFlops (0.118 FMA) - 5.3 sec per iteration training - 45ms inference (32 BIT-1 ops = 1 Flops)
  • MixNet-M-GPU (minor modification for GPU): mixnet_m_gpu.cfg.txt - 1.0 BFlops (0.500 FMA) - 2.7 sec per iteration training - 45 ms inference

Training these three models take too much time, currently...

  • MixNet-M: epoch ~70,000, loss ~3.9.
  • MixNet-M-XNOR: epoch ~50,000, loss ~5.5.
  • MixNet-M-GPU: epoch ~100,000, loss ~3.6.

I will continue training MixNet-M-GPU and stop training another two models.

@WongKinYiu

What Top1/Top5 accuracy can you get for these 3 models currently?

@AlexeyAB

  • MixNet-M: epoch ~70,000, loss ~3.9, top-1 =21.96%, top-5 =45.85%.
  • MixNet-M-XNOR: epoch ~50,000, loss ~5.5, top-1 =0.1%, top-5 =0.5%.
  • MixNet-M-GPU: epoch ~100,000, loss ~3.6, top-1 =24.49%, top-5 =49.14%.

@WongKinYiu Thanks, it seems that xnor=1 isn't suitable for these places where I added it for MixNet-M-XNOR model.

@WongKinYiu Hi,

I will continue training MixNet-M-GPU and stop training another two models.

MixNet-M-GPU: epoch ~100,000, loss ~3.6, top-1 =24.49%, top-5 =49.14%.

what result did you get?

@AlexeyAB Hello,
Currently 320k epochs, loss ~= 2.5.
Currently 413k epochs, loss ~= 2.0, top-1 ~= 53.9%, top-5 ~= 79.1%.
Currently 480k epochs, loss ~= 1.7, top-1 ~= 60.4%, top-5 ~= 84.3%.
Currently 546k epochs, loss ~= 1.4, top-1 ~= 65.7%, top-5 ~= 87.1%.
Currently 618k epochs, loss ~= 1.2, top-1 ~= 69.4%, top-5 ~= 89.3%.
Currently 681k epochs, loss ~= 1.0, top-1 ~= 71.0%, top-5 ~= 90.3%.
Currently 736k epochs, loss ~= 1.0, top-1 ~= 71.4%, top-5 ~= 90.5%.
finished , top-1 ~= 71.5%, top-5 ~= 90.5%.

@WongKinYiu After that MixNet-M-GPU will be trained on ImageNet, I will try to implement EfficientDet - BiFPN (trainable Fussion layer, shared weights in 2-3 last conv-layers before yolo-layer, ...) with MixNet-M-GPU backend: #4346
With [Gaussian_yolo] layer + CIoU-loss.

@AlexeyAB Thanks,

I will training DIoU and CIoU.
I send an invitation of my private repo to you, then you can see the updated results at the repo.

@WongKinYiu,
can you give me a link to CIOU and Diou papers ?

@AlexeyAB ,
Did you provide EfficientNet model or convert Efficientnet pretrained with ImageNet model to darknet.

@CuongNguyen218

ImageNet and COCO models of EfficientNet-B0: #3874 (comment)

@AlexeyAB , what result did you get?

@AlexeyAB

mixnet-m-gpu, top-1 = 71.5%, top-5 = 90.5%.

@WongKinYiu Nice! Can you share weights-file?

Why is your results very different from paper?

becuz mixnet-m-gpu is designed by @AlexeyAB, not appears in the paper.

@CuongNguyen218

Why is your results very different from paper?

Because in the paper MixNet and Efficientnet are trained with very large mini_batch_size on DGX-2 / Cluster ~400k$ - 1M$.
You can achieve the same accuracy 77.0% Top1 by using Darknet with #4386

If we train with the same mini_batch_size, then EfficientNet-B0 (official) has even lower Top1/5 accuracy than my EfficientNet-B0: https://github.com/WongKinYiu/CrossStagePartialNetworks#small-models

Also, I slightly optimized MixNet on GPU so that it can be trained in 1 month instead of 2 months.

@CuongNguyen218 If you want you can train original MixNet-M on ImageNet: #4203 (comment)

MixNet-M: mixnet_m.cfg.txt - 0.759 BFlops (0.379 FMA) - 4.6 sec per iteration training - 45ms inference

https://github.com/AlexeyAB/darknet/files/3838329/mixnet_m.cfg.txt

@AlexeyAB I just started looking into MixConvs. They seem very interesting! Do you know of anywhere that they are applied to object detection or are they only used in classification?

EfficientDet was published in November 2019, while MixConv was published in July 2019, so the EfficientDet authors clearly must have been aware of this type of convolution but neglected to use it for some reason I'm thinking.

@glenn-jocher

There are the same authors in all three articles: MixNet, EfficientNet, EfficientDet

  • EfficientNet uses Grouped-Conv
  • MixNet uses Grouped-Conv with different kernel_size

Both EfficientNet / MixNet are not optimal for the current CPU/GPU/Neuro-chips (MyriadX, Coral-TPU-Edge).

So they do such network as a reference-network to help to create a new neurochips (new version of TPU-edge).

So may be the reason why they don't use MixNet for Detector: Creating a neurochip for EfficientNet (Grouped-conv) is much easier than for MixNet (Grouped-Conv with different kernel_size).

Also may be MixNet has lower BFlops, but also slower.

@AlexeyAB Ah I see, that's an interesting approach. Yes it seems like hardware speeds for all of these new grouped convolution techniques are quite slow, despite the lower parameter count.

Hi @AlexeyAB , I am trying to do inference on mixnet model using your config and pretrained weights mentioned in the starting of the tread, but I am getting error: " Error: in the file data/coco.names number of names 80 that isn't equal to classes=0 in the file cfg/mixnet_m_gpu.cfg
". The number of classes is not mentioned in the config file, but this error says so. And even if it implies that it was trained on a different number of classes, it still does not makes sense to have 0 classes in a config file. Am I missing something here? Someone help me out.

I tried running it on my ubuntu 18.04 by using command: "./darknet detector test cfg/coco.data cfg/mixnet_m_gpu.cfg mixnet_m_gpu_final.weights -ext_output data/dog.jpg"