Is it possible to replace the way computing soft weights with simple conv?

Question

Is it possible to replace the way computing soft weights with simple conv?

Closed this issue 6 years ago · 6 comments

Hi,

First, thanks for sharing the code, get a lot of inspirations. While I am wondering whether it is possible to replace the way you compute the weights that feed into the VLAD core in the original paper with simple tf.conv2(...)? As shown in the pic, actually, the size of the filters are 1 X 1 X Dim X # of Clusters. It would be nice if you can check whether my understanding is correct. Thanks in advance!

Answer 1 · 2018-03-13T10:18:10.000Z

Im not sure I understand the question. What about the centers ?

Answer 2 · 2018-03-13T10:23:34.000Z

In my opinion, the centers are trainable, and they are in the same dim as the input. So we only need to initialize the centers as [# of clusters, dim] using Xavier method or anything else as you also mentioned initialize with K-means does not yield any performance gain. At each training step, we will get updated clusters, right?

Answer 3 · 2018-03-13T10:26:40.000Z

And my question is whether we can replace this part of code with simple convolution? As shown in the original paper, they use 1x1xDxK filters.

Answer 4 · 2018-03-13T10:31:55.000Z

I am a bit rusty on the details (not my paper, just implemented it) but your idea seems ok. Notice however that the weights are only shared across k, so, i think you can try to replace the code. let me know if it comes out the same.

Answer 5 · 2018-04-22T19:52:25.000Z

Hi,

I am also working on netVLAD for place recognition.

In my opinion, VLAD layer is not just another conv2d layer. So you cannot replace it with tf.conv().
The reason is that VLAD layer accumulates the global residual of all pixels according to Equation (4) in the original paper https://arxiv.org/pdf/1511.07247.pdf

For VLAD layer, each pixel in the output (size WxHx64) contains information from all pixels of the input (size WxHx512). If you use conv2d layer instead, each pixel in the output only contains information from local pixels of the input.

Answer 6 · 2019-03-17T11:47:15.000Z

Closing due to lack of activity.