Relja/netvlad

About NetVLAD

Closed this issue · 3 comments

MrHwc commented

when the input is (N, W, H, D), the output of NetVLAD is (N, DK), the output of GhostVLAD is also (N, DK), is it correct? It is indicated in GhostVLAD that Take any number of images as input, and output a fixed-length template descriptor to represent the input image set, it should be N vectors aggregated into 1 vector, i.e. input is (N, W, H, D), output is (1, D*K). How can I understand this sentence?

Relja commented

VLAD, NetVLAD and GhostVLAD all take N D-dimensional vectors and aggregate them into K*D-dimensonal vector. The question is what the input N vectors are.

In this code, for one image we take WxH neural network activations that are D-dimensional and use N=WxH. The confusion is that we do it in a batched fashion, i.e. B X WxHxD gets converted to Bx D*K

In the GhostVLAD paper, the point is that the vectors aggregated each come from a different image. So one image is represented as a D-dimensional vector, N images (an image set) contain the NxD vectors and these are aggregated to DK. So N images produce a single DK vector.

Is that clear now?

MrHwc commented

Thank you. I can understand NetVLAD, but I am still confused about GhostVLAD. For example, if B pictures are read, the generated feature map size is (B, W, H, D). After GhostVLAD, the output size is (B, DK), not (1, DK). Did I confuse B and N? If so, what is the shape of the feature map generated after the N pictures (one image set) are extracted by CNN? In addition to increasing the G cluster, GhostVLAD has been structurally changed so that the features of the N pictures are aggregated into one.

Relja commented

I think you are confusing what Net/GhostVLAD do and the implementation in this repository. Forget the differences under the hood between NetVLAD and GhostVLAD, the only thing that matters is that they all take NxD vectors in and produce a single K*D vector out. The only question is what are the N vectors.

In the original NetVLAD work, we consider every image in isolation and the goal is to produce a single image representation. We take the WxHxD feature map, pool over spatial dimensions and obtain K*D vector for the image.

In the GhostVLAD work, the goal is to have a single representation for a set of N images. We take NxD features (i.e. you pass each image in isolation through a network which computes a D-dimensional representation for it, e.g. ResNet with average pooling over spatial dimensions) and then pool over the images.

This code supports the first case - it looks over space and does it in a batched manner, so you get BxWxHxD -> Bx(KD). But you can hack it into doing it over images by reshaping, so you get NxD as input, reshape into 1x1xNxD, pass through NetVLAD and you will again get KD. Or if you have a batch B image sets of N images each, BxNxD reshape into Bx1xNxD and again you'll get the Bx(K*D) as desired.