zhengyang-wang/Deeplab-v2--ResNet-101--Tensorflow

Image order RGB or BRG?

John1231983 opened this issue · 6 comments

Hello, I am using the official resnet-101 pre-trained model (as your link). It is trained from ImageNet, with image order is RGB and IMAGE_MEAN is

_R_MEAN = 123.68 / 255
_G_MEAN = 116.78 / 255
_B_MEAN = 103.94 / 255

The official resnet-101 pre-processing L223 is

channels = tf.split(axis=2, num_or_size_splits=num_channels, value=image)
  for i in range(num_channels):
    channels[i] -= means[i]
  return tf.concat(axis=2, values=channels)

While your code is converting RGB to BRG and used another IMAGE_MEAN. I think we should you same pre-processing as pre-trained model did such as RGB order and imagenet image mean. Am I right?

For deeplab pre-trained models, I believe the order of image channels is BGR.
I provide the pre-trained resnet models from TF Slim, where the means should not be divided by 255.
https://github.com/tensorflow/models/blob/master/research/slim/preprocessing/vgg_preprocessing.py
And you are right about the order. If resnet pre-trained models are used, the order should be changed back to RGB. A one-line change is enough.

But actually I don't think it is crucial. In this task, the size of training patches is also different from that in resnet. And the set of images is different. Maybe simply using image_mean=[127.5, 127.5, 127.5] will work well.

Agree. I have tested with difference IMAGE_MEAN and it has no much performance diffference. Could I ask ome more question about pretrained model? When you use resnet pretrained model, it means that you will copy pretrained weight of resnet to encoder part of deeplab network. Then we will train the decoder part. But I found that you also trained the encoder part after copy weight from resnet model. Why did not only train the decoder part? Thanks

This is because the set of images changes. It is true that they are all natural images with similar features so that transfer learning is feasible here. However, images in PASCAL or CITYSCAPES do not appear in ImageNet. Thus, we'd like to fine-tune the encoder to let it fit the new set of images. Actually, we use the pre-trained models in order to make sure the training converge, as the number of images in PASCAL or CITYSCAPES is much smaller than that in ImageNet.

I see. So it looks like we use pretrain model to have a good weight initialization. Then retrain the model with the good weight. Am I right?