Stride and Dilation?
Max-Fu opened this issue · 6 comments
In the paper, it says
In order to alleviate this issue, we follow [7] and modify the ResNet-50 architecture [13] by reducing the stride of the network and introducing dilation factors.
However, the "dilation factors" nor the "stride" is explicitly stated in the paper (I failed to calculate both of these factors based on the feature maps' shapes provided on page 3 of the paper). Looking into paper #7, Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs (Deeplab v2), the Resnet it proposed is all but regarding Resnet-101 but not Resnet-50. Is it possible for you to release the stride and dilation (tf.nn.atrous) factor you used in the experiment?
Thanks!
Max
Hi Max-Fu,
In our paper, we refer to “stride of the network” as the ratio of the input image spatial resolution and the final output resolution of ResNet50. All our experiments, were conducted using output_stride=8.
For example, In the classical ResNet50 the final feature responses (before f.c layers or global pooling) is 32 times smaller than the input image dimension, thus it has an output stride = 32.
If one would like to quadruple the spatial density of the computed feature responses ( output stride = 8 , as in our poly), the stride of last pooling or convolutional layer that decreases resolution( in this case <=28) is set to 1 to avoid signal decimation. Then, all subsequent convolutional layers are replaced with atrous convolutional layers having rate r = 2 and r=4. Concretely, we use dilation rates r=2 in block3 and r=4 in block4.
here, you can see a good example, of how this is usually done
https://github.com/tensorflow/models/blob/master/research/slim/nets/resnet_utils.py#L191-L202
Hope this help,
David
Hello David,
Thank you for your reply. Currently, I am trying to implement and understand your research (for learning atrous convolution and attention weighted features). Thanks to your explanation, I was able to recreate the ResNet structure. For concatenating the feature maps (with the shape of 112x112x256), were the feature map outputs first concatenated (after bilinear upsampling) then convolved with 256 filters (currently it yields me 28282880 after concatenating the feature maps)?
Another question is regarding the feature map of Ft (in the weighted attention features). I am quite new to ConvLSTM and attention weighted features, please correct me if I am wrong. The input of the convolutional LSTM (at each time step), if I understood it correctly, has the same shape as the skip features (2828128). Concatenating Ft with output at time step yt-1 and yt-2 will yield a feature map of shape 2828130 (128+1+1). So I have to perform another convolution with 128 filters to obtain the same shape?
Thank you again for your explanation!
Best,
Max
Hi Max,
For the resnet, each feature map (for example 112x112x256, 28x28x2048 etc.) was first convolved (denoted by the dotted line in the diagram with 64 filters) and the 4 outputs with 64 filters each were concatenated to get 256 filters.
For the convLSTM, the input size is 28x28x131, since the 1st vertex is also provided as input at each time step. There is no extra convolution to convert it to 28x28x128, and 28x28x131 is the final input to the first layer of the convLSTM.
Hi amlankar,
Thank you for your response. Now I understand it.
Best,
Max
Hi @Max-Fu,
We have released the training code now! https://github.com/fidler-lab/polyrnn-pp-pytorch
Thanks!