Performance Comparison with NetVLAD
divyagupta25 opened this issue · 3 comments
Hi, thanks for your work and sharing the code!
I have been comparing the performance of CosPlace (vgg16_512.pth) and NetVLAD (pre-trained on Pittsburgh, 4096-descriptor) on a custom aerial image dataset (2500 query images and 2500 gallery images; using image index in place of UTM coordinates since the traverses are exactly aligned, and 25 frames as the tolerance). I have been getting R@1 and R@5 as 38.02 and 53.6 using NetVLAD. However, it is only 33.2 and 40 in case of CosPlace. As mentioned in the paper, should it perform better? Am I missing something?
Hi, it is very difficult to find out where the problem is given so little information. Can you share the dataset, or at least one image?
Note that the domain gap between your dataset and the datasets used to train NetVLAD and CosPlace is probably very large, which can explain the low performances on your dataset. In our paper, we show that CosPlace largely outperforms NetVLAD on visual geolocalization / place recognition datasets, not on aerial datasets.
Also, using deeper nets should help to improve the results (we used a VGG-16 only for fairness with previous methods, but we advise using newer backbones). You can try a ResNet-18-512 for fast results, or a ResNet-152-2048 for best results.
I hope this helps
Hi,
Thanks for the prompt response. I agree about the domain gap, but I am using both NetVLAD and CosPlace in inference mode and results are better in the former. I would also like to add that I have tried PatchNetVLAD as well (in inference mode, pre-trained on pitts30k dataset, 4096-D) and it gives R@1 and R@5 as 67.09 and 78.25. So it should be safe to say that the NetVLAD methods are adapting better.
I tried the newer backbones, and surprisingly the results are poorer than VGG-16-512. Please find a summary of the results below:
Model | R@1 | R@5 |
---|---|---|
CosPlace (ResNet-18-512) | 29.8 | 32.9 |
CosPlace (ResNet-152-2048) | 26.6 | 31.7 |
CosPlace (VGG-16-512) | 33.2 | 40.0 |
NetVLAD 4096-D | 38.02 | 53.66 |
PatchNetVLAD (4096-D, Patch sizes: 2,5,8) | 67.09 | 78.25 |
Thanks for sharing the results, they are quite interesting.
The images from your dataset are indeed very different from the ones in geolocalization, perhaps the more a network learns to solve geolocalization, the more it unlearns to solve your dataset. This would explain the negative correlation between results on datasets like Pitts30k and your aerial dataset. Also note that Patch-NetVLAD belongs to a different category of methods as it applies spatial verification for re-ranking: given that your images have little visual overlap (less than 50%), methods using spatial verification for re-ranking are expected to perform better