scut-aitcm/Competitive-Inner-Imaging-SENet

1x1 and 2x2 pair-view convolution?

Opened this issue · 13 comments

As my understanding, you have 1xC GAP vector (global average pooling) from the residual channel and 1xC from the identical channel. Concatenation them together you got 2xC GAP vector. From the 2xC GAP vector, we want to combine/mix them together to obtain 1xC attention vector. To combine, you have two options: 2x1 conv (kenel size (2,1)) and 1x1 conv (kernel size (1,1)).

  1. In your experiment, the 1x1 convolution shows a better result than 2x1 convolution. Could you tell me the reason?
  2. The output of GAP will be batch_size x C. Then you will reshape it to batch_size x 1 x C x 1 to apply conv2d. After concat by conv_x_concat = nd.concat(se_input_conv, se_input_skipx, dim=-1), the conv_x_concat will have size of batch_size x 1 x C x 2. Is it right?

se_input_conv = se_input_conv.reshape(shape=(se_input_conv.shape[0], 1, se_input_conv.shape[1], 1))

About the first question, we have explained in our paper, however, it can not be available now, for some reasons. We will open the original paper soon. (Looks like you have read it already)
The second question, I will invite the coder of this paper to response it.

2x1 Conv binds the signals in the vertical position, models them, it will use less parameters.
However, the ”Conv (2 x 1)” pair-view strategy models the competition between the residual and identity channels based on strict upper and lower positions, which ignores the factor that any feature signal in the re-imaged tensor could be associated with any other signal, not only in the location of the vertical direction.

Thank you for your attention to our work.
For the second question, you are right. In this code, we convolve the map(batch_size,1,C,2) with 1x1 or 1x2 convolution kernels. And in the paper, we describes that we convolve the map(batch_size,1,2,C) with 1x1 or 2x1 convolution kernel. They are the same in theory and effect.

Due to the anonymous policy, we temporarily stopped updating the code. We will publish the paper later and update the code to keep the code consistent with the paper.

@superhy and @luonango : Thanks for your explanation. For the second question, given a feature map size of batch_sizex1xCx2 with convolution kernel 1x1, how to obtain the output batch_sizex1xCx1
This is doc

For general 2-D convolution, the shapes are

data: (batch_size, channel, height, width)
weight: (num_filter, channel, kernel[0], kernel[1])
bias: (num_filter,)
out: (batch_size, num_filter, out_height, out_width).
Define:

f(x,k,p,s,d) = floor((x+2*p-d*(k-1)-1)/s)+1
then we have:

out_height=f(height, kernel[0], pad[0], stride[0], dilate[0])
out_width=f(width, kernel[1], pad[1], stride[1], dilate[1])

In my opinion, it should be batch_sizex2xCx1 with convolution kernel 1x1, then output will be batch_sizex1xCx1

1x1:
data: (batch_size, 1, C, 2)
kernel: 1x1
kernel_num: C/16
out: (batch_size, C/16, C, 2)
mean_out: (batch_size, 1, C, 2)
flatten: (batch_size, 2C)

1x2:
data: (batch_size, 1, C, 2)
kernel: 1x2
kernel_num: C/16
out: (batch_size, C/16, C, 1)
out_mean : (batch_size, 1, C, 1)
flatten: (batch_size, C)

I see. It looks you enlarger the channel from 1 to C/16 then average them together to 1 vector. And then flatten to 2C and apply FC to obtain attention map batch_sizexC in case of 1x1. Am I right? Thanks

Yes, that's it.

Just final question before close it, Why not do something likes as below?
batch_size x C concatenates with batch_size x C to obtain batch_size x 2C, then reshape to batch_size x 2C x 1 x 1. After that, apply 1x1 convolution to reduce batch_size x C x 1 x 1 (similar feature reduction in bottleneck layers).

Oh, yeh! maybe it is also a good solution.
However, we want to reflect the idea of "Inner-Imaging" more directly and vividly, moreover, prepare a smooth transition to the folded mode of "3x3 Conv" to modeling the channel-wise relationship.
BTW, we are afraid that too much parameter reduction in the channel-wise attention module will lead to performance loss. After all, the CMPE-SE module itself does not bring too many additional parameters and calculations.

Thanks for reply.

we are afraid that too much parameter reduction in the channel-wise attention module will lead to performance loss

So you was enlarger it from 1 to C/16 and then average, instead of using option batch_size×2Cx1x1 --> batch_sizexCx1x1

Using C convolution kernels (1x1) to transform dimensions from (batch_size, 2C, 1, 1) to (batch_size, C, 1,1) is exactly the function of FC. However, the parameters of our method are much less.

Yes. It is FC. Do you try it before and compare with your approach. Many papers using this way (Senet...) so you can show your benefit between the approaches. I guess performance must be same

Your understanding is somewhat correct.
However, base on the following reasons, we insist on using the approach described in our paper rather than using the usual se-net:

  1. Reducing shape from batch_size x 2C x 1 x 1 to batch_size x C x 1 x 1 still costs C 1x1 filters.
  2. We do not want to disrupt the location relationship between channel signals by reshaping, and this is the key to run the "Inner-Imaging" mechanism.
    e.g. with 3x3 Conv mode of "Inner-Imaging" mechanism, we can modeling the channel-wise relationship synchronously in location of: (Up, down, left, right, left up, left down, right up, right down...), in one "Inner-Imaging" filter, which is not possible in typical SE-block.
    In our experiments, we have verified the further improvement of "Inner-Imaging" and CMPE-SE than original SE-net.