/deep-visual-geo-localization-benchmark

Official code for CVPR 2022 (Oral) paper "Deep Visual Geo-localization Benchmark"

Primary LanguagePythonMIT LicenseMIT

Deep Visual Geo-localization Benchmark

This is the official repository for the CVPR 2022 Oral paper Deep Visual Geo-localization Benchmark. It can be used to reproduce results from the paper, and to compute a wide range of experiments, by changing the components of a Visual Geo-localization pipeline.

Setup

Before you begin experimenting with this toolbox, your dataset should be organized in a directory tree as such:

.
├── benchmarking_vg
└── datasets_vg
    └── datasets
        └── pitts30k
            └── images
                ├── train
                │   ├── database
                │   └── queries
                ├── val
                │   ├── database
                │   └── queries
                └── test
                    ├── database
                    └── queries

The datasets_vg repo can be used to download a number of datasets. Detailed instructions on how to download datasets are in the repo. Note that many datasets are available, and pitts30k is just an example.

Running experiments

Basic experiment

For a basic experiment run

$ python3 train.py --dataset_name=pitts30k

this will train a ResNet-18 + NetVLAD on Pitts30k. The experiment creates a folder named ./logs/default/YYYY-MM-DD_HH-mm-ss, where checkpoints are saved, as well as an info.log file with training logs and other information, such as model size, FLOPs and descriptors dimensionality.

Architectures and mining

You can replace the backbone and the aggregation as such

$ python3 train.py --dataset_name=pitts30k --backbone=resnet50conv4 --aggregation=gem

you can easily use ResNets cropped at conv4 or conv5.

Add a fully connected layer

To add a fully connected layer of dimension 2048 to GeM pooling:

$ python3 train.py --dataset_name=pitts30k --backbone=resnet50conv4 --aggregation=gem --fc_output_dim=2048

Add PCA

To add PCA to a NetVLAD layer just do:

$ python3 eval.py --dataset_name=pitts30k --backbone=resnet50conv4 --aggregation=netvlad --pca_dim=2048 --pca_dataset_folder=pitts30k/images/train

where pca_dataset_folder points to the folder with the images used to compute PCA. In the paper we compute PCA's principal components on the train set as it showed best results. PCA is used only at test time.

Evaluate trained models

To evaluate the trained model on other datasets (this example is with the St Lucia dataset), simply run

$ python3 eval.py --backbone=resnet50conv4 --aggregation=gem --resume=logs/default/YYYY-MM-DD_HH-mm-ss/best_model.pth --dataset_name=st_lucia

Reproduce the results

Finally, to reproduce our results, use the appropriate mining method: full for pitts30k and partial for msls as such:

$ python3 train.py --dataset_name=pitts30k --mining=full

As simple as this, you can replicate all results from tables 3, 4, 5 of the main paper, as well as tables 2, 3, 4 of the supplementary.

Resize

To resize the images simply pass the parameters resize with the target resolution. For example, 80% of resolution to the full pitts30k images, would be 384, 512, because the full images are 480, 640:

$ python3 train.py --dataset_name=pitts30k --resize=384 512

Query pre/post-processing and predictions refinement

We gather all such methods under the test_method parameter. The available methods are hard_resize, single_query, central_crop, five_crops_mean, nearest_crop and majority_voting. Although hard_resize is the default, in most datasets it doesn't apply any transformation at all (see the paper for more information), because all images have the same resolution.

$ python3 eval.py --resume=logs/default/YYYY-MM-DD_HH-mm-ss/best_model.pth --dataset_name=tokyo247 --test_method=nearest_crop

Data augmentation

You can reproduce all data augmentation techniques from the paper with simple commands, for example:

$ python3 train.py --dataset_name=pitts30k --horizontal_flipping --saturation 2 --brightness 1

Off-the-shelf models trained on Landmark Recognition datasets

The code allows to automatically download and use models trained on Landmark Recognition datasets from popular repositories: radenovic and naver. These repos offer ResNets-50/101 with GeM and FC 2048 trained on such datasets, and can be used as such:

$ python eval.py --off_the_shelf=radenovic_gldv1 --l2=after_pool --backbone=r101l4 --aggregation=gem --fc_output_dim=2048

$ python eval.py --dataset_name=pitts30k --off_the_shelf=naver --l2=none --backbone=r101l4 --aggregation=gem --fc_output_dim=2048

Using pretrained networks on other datasets

Check out our pretrain_vg repo which we use to train such models. You can automatically download and train on those models as such

$ python train.py --dataset_name=pitts30k --pretrained=places

Changing the threshold distance

You can use a different distance than the default 25 meters as simply as this (for example to 100 meters):

$ python3 eval.py --resume=logs/default/YYYY-MM-DD_HH-mm-ss/best_model.pth --val_positive_dist_threshold=100

Changing the recall values (R@N)

By default the toolbox computes recalls@ 1, 5, 10, 20, but you can compute other recalls as such:

$ python3 eval.py --resume=logs/default/YYYY-MM-DD_HH-mm-ss/best_model.pth --recall_values 1 5 10 15 20 50 100

Model Zoo

We are currently exploring hosting options, so this is a partial list of models. More models will be added soon!!

Pretrained models with different backbones
Pretained networks employing different backbones.

Model Training on Pitts30k Training on MSLS
Pitts30k (R@1) MSLS (R@1) Download Pitts30k (R@1) MSLS (R@1) Download
vgg16-gem 78.5 43.4 [Link] 70.2 66.7 [Link]
resnet18-gem 77.8 35.3 [Link] 71.6 65.3 [Link]
resnet50-gem 82.0 38.0 [Link] 77.4 72.0 [Link]
resnet101-gem 82.4 39.6 [Link] 77.2 72.5 [Link]
ViT(224)-CLS _ _ _ 80.4 69.3 [Link]
vgg16-netvlad 83.2 50.9 [Link] 79.0 74.6 [Link]
resnet18-netvlad 86.4 47.4 [Link] 81.6 75.8 [Link]
resnet50-netvlad 86.0 50.7 [Link] 80.9 76.9 [Link]
resnet101-netvlad 86.5 51.8 [Link] 80.8 77.7 [Link]
cct384-netvlad 85.0 52.5 [Link] 80.3 85.1 [Link]
Pretrained models with different aggregation methods
Pretrained networks trained using different aggregation methods.

Model Training on Pitts30k (R@1) Training on MSLS (R@1)
Pitts30k (R@1) MSLS (R@1) Download Pitts30k (R@1) MSLS (R@1) Download
resnet50-gem 82.0 38.0 [Link] 77.4 72.0 [Link]
resnet50-gem-fc2048 80.1 33.7 [Link] 79.2 73.5 [Link]
resnet50-gem-fc65536 80.8 35.8 [Link] 79.0 74.4 [Link]
resnet50-netvlad 86.0 50.7 [Link] 80.9 76.9 [Link]
resnet50-crn 85.8 54.0 [Link] 80.8 77.8 [Link]
Pretrained models with different mining methods
Pretained networks trained using three different mining methods (random, full database mining and partial database mining):

Model Training on Pitts30k (R@1) Training on MSLS (R@1)
Pitts30k (R@1) MSLS (R@1) Download Pitts30k (R@1) MSLS (R@1) Download
resnet18-gem-random 73.7 30.5 [Link] 62.2 50.6 [Link]
resnet18-gem-full 77.8 35.3 [Link] 70.161.8 [Link]
resnet18-gem-partial 76.5 34.2 [Link] 71.6 65.3 [Link]
resnet18-netvlad-random 83.9 43.6 [Link] 73.3 61.5 [Link]
resnet18-netvlad-full 86.4 47.4 [Link] -- -
resnet18-netvlad-partial 86.2 47.3 [Link] 81.6 75.8 [Link]
resnet50-gem-random 77.9 34.3 [Link] 69.5 57.4 [Link]
resnet50-gem-full 82.0 38.0 [Link] 77.3 69.7 [Link]
resnet50-gem-partial 82.3 39.0 [Link] 77.4 72.0 [Link]
resnet50-netvlad-random 83.4 45.0 [Link] 74.9 63.6 [Link]
resnet50-netvlad-full 86.0 50.7 [Link] -- -
resnet50-netvlad-partial 85.5 48.6 [Link] 80.9 76.9 [Link]

If you find our work useful in your research please consider citing our paper:

@inProceedings{Berton_CVPR_2022_benchmark,
    author    = {Berton, Gabriele and Mereu, Riccardo and Trivigno, Gabriele and Masone, Carlo and
                 Csurka, Gabriela and Sattler, Torsten and Caputo, Barbara},
    title     = {Deep Visual Geo-localization Benchmark},
    booktitle = {CVPR},
    month     = {June},
    year      = {2022},
}

Acknowledgements

Parts of this repo are inspired by the following great repositories:

Check out also our other repo CosPlace, from the CVPR 2022 paper "Rethinking Visual Geo-localization for Large-Scale Applications", which provides a new SOTA in visual geo-localization / visual place recognition.