What the reason for the low mIoU?
John1231983 opened this issue · 11 comments
Great project. Could you tell me why your reproduce is lower than paper? Thanks
Thanks for your interest.
I believe there are several potential reasons:
- There may be bugs in my code. I used
slim
library for the base Resnet model. I may be missing some hyper-parameter dependencies used there. - I used Resnet v2 instead of Resnet v1 unlike the original paper used. It could affect the mIoU.
- For the end learning rate of "poly" learning rate policy,
1e-8
is used, which I'm not sure if it is right choice. - Most likely, I implementation is not memory efficient enough. It turns out that the batch size is important for improving mIoU, as it is also shown in the paper. In the paper they are able to fit
batch_size = 16
foroutput_stride = 16
. However, my implementation only fitbatch_size = 9
foroutput_stride = 16
with GTX 1080Ti. I am suspecting this could be affecting the lower mIoU for the repo.
I would really appreciate if you could shed any light on it. Thanks.
I also used deeplabv3 and it achieved 74% in batch size of 8. I used pretrained resnet101 and learning rate of 7e-3 in 30k and 1e-4 in 30k with same OS is 16. I guess the main problem is learning schedule instead of batch size. Because this paper also reported the performance of batch size of 8 and it is a 75.78%. I am detecting the proplem but I have not find yet
Great point. I also investigate several end learning rate with batch size of 8 and let you know if I find improvement. Thanks for your help.
Thanks for sharing your code! I'll also have a look at what could be causing the reduction in the Pascal VOC performance.
Hi @John1231983 , @JulienSiems , thanks for your help.
As John suggested, I investigated if I could improve performance to paper's mIoU at batch_size = 8
(75.76%
). Here's what I found so far:
- It turns out the
end_learning_rate
does not affect much if it is less than 1e-6. I just keep 1e-6 as default values. - The aspp layer in the repo accidentally holds
relu activation functions
which are not mentioned in the paper. Interestingly removingrelu activaton functions
degrade performance 2 to 3 %, so I decided to leave them in the repo. - I found tuning
weight decay
improve performance. Whenweight decay = 5e-4 and batch=8
, mIoU is72.84%
, but changingweight decay = 2e-4
improves mIoU to74.98%
. - Currently I am wondering if I should've include postnorm layer of resnet_v2 to deeplabv3 model, because the pre-activation variant does not have batch normalization or activation functions in the residual unit output. I'll let know if this could further improve performance.
- Also, apparently newly released TensorFlow 1.6 deals with GPU memory better. With TF 1.6 I could
batch_size = 10
with GTX 1080Ti, which could improve the final mIoU.
Interesting for the fifth point. Although I have 2 TitanX, I cannot take full advantage of them because current tensorflow does not support syn. batchnorm. As the paper mentioned, they train on a single GPU K80 -48GB that can allow training with the batch size of 16.
For weight decay, I used 1e-4 and it looks good. For relu, if the paper did not mention, so the default is using it after batch norm. The author used resnet-101 so I think it is better to follows his configure
Hi @John1231983 , thanks for valuable comments and advice.
I tried postnorm layer of resnet_v2 but the performance did not improve. So I kept not adding it in the network. In fact, I tried to used resnet_v1 as well, but since resnet_v1 uses more GPU memory than resnet_v2, I ended up using only resent_v2.
Regarding the performance improvement, using TF 1.6 and Tesla V100, I was able to fit batch_size = 16
, with weight_decay = 1e-4
and the mIoU improved to 76.42%
, which is shared.
Also, following their newly publish paper, Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation (DeepLabv3+), I implemented the decoder to the model, which improved the mIoU to 77.31%
. I share the mode with decoder in new repo.
Good job. I think the main problem to reproduce the method is how to use larger batch size. I just use TitanX that maximum of 12Gb, so I can only run batch size of 8. I wonder that how you run in Tesla V100. Do we have any cloud or you have it on your local machine?
Agree. Reducing GPU memory usage is a big factor to improve mIoU. I only own GTX 1080Ti locally, with which I can only fit batch size of 11 at most. After I tuned other hyper-parameters (with batch_size = 8) with it, I scaled batch size using aws p3-instance, which equips Tesla V100.
I see. I think I have to install TF 1.6 for using large batch size. I am using TF 1.4 that I remember only can use batch size of 8 and OS=16. I have check the price of p3-instance that is too expensive (about 3.06USD/hour), so I guess you send more than 40usd for running it. Very exciting.
Yes. By upgrading TF 1.6 from TF 1.5, the model supported more memory and I found the log saying "Finalizing graph", which I believe the reason for better memory management.