SelaVPR

This is the official repository for the ICLR 2024 paper "Towards Seamless Adaptation of Pre-trained Models for Visual Place Recognition".

Summary

This paper presents a novel method to realize Seamless adaptation of pre-trained foundation models for the (two-stage) VPR task, named SelaVPR. By adding a few tunable lightweight adapters to the frozen pre-trained model, we achieve an efficient hybrid global-local adaptation to get both global features for retrieving candidate places and dense local features for re-ranking. The SelaVPR feature representation can focus on discriminative landmarks, thus closing the gap between the pre-training and VPR tasks (fully unleash the capability of pre-trained models for VPR). SelaVPR can directly match the local features without spatial verification, making the re-ranking much faster.

The global adaptation is achieved by adding adapters after the multi-head attention layer and in parallel to the MLP layer in each transformer block (see adapter1 and adapter2 in /backbone/dinov2/block.py).

The local adaptation is implemented by adding up-convolutional layers after the entire ViT backbone to upsample the feature map and get dense local features (see LocalAdapt in network.py).

Getting Started

This repo follows the Visual Geo-localization Benchmark. You can refer to it (VPR-datasets-downloader) to prepare datasets.

The dataset should be organized in a directory tree as such:

├── datasets_vg
    └── datasets
        └── pitts30k
            └── images
                ├── train
                │   ├── database
                │   └── queries
                ├── val
                │   ├── database
                │   └── queries
                └── test
                    ├── database
                    └── queries

Before training, you should download the pre-trained foundation model DINOv2(ViT-L/14) HERE.

Train

Finetuning on MSLS

python3 train.py --datasets_folder=/path/to/your/datasets_vg/datasets --dataset_name=msls --queries_per_epoch=30000 --foundation_model_path=/path/to/pre-trained/dinov2_vitl14_pretrain.pth

Further finetuning on Pitts30k

python3 train.py --datasets_folder=/path/to/your/datasets_vg/datasets --dataset_name=pitts30k --queries_per_epoch=5000 --resume=/path/to/finetuned/msls/model/SelaVPR_msls.pth

Trained Models

The model finetuned on MSLS (for diverse scenes).

DOWNLOAD	MSLS-val			Nordland-test			St. Lucia
DOWNLOAD	R@1	R@5	R@10	R@1	R@5	R@10	R@1	R@5	R@10
LINK	90.8	96.4	97.2	85.2	95.5	98.5	99.8	100.0	100.0

The model further finetuned on Pitts30k (only for urban scenes).

DOWNLOAD	Tokyo24/7			Pitts30k			Pitts250k
DOWNLOAD	R@1	R@5	R@10	R@1	R@5	R@10	R@1	R@5	R@10
LINK	94.0	96.8	97.5	92.8	96.8	97.7	95.7	98.8	99.2

Test

Set rerank_num=100 to reproduce the results in paper, and set rerank_num=20 to achieve a close result with only 1/5 re-ranking runtime (0.018s for a query).

python3 eval.py --datasets_folder=/path/to/your/datasets_vg/datasets --dataset_name=pitts30k --resume=/path/to/finetuned/pitts30k/model/SelaVPR_pitts30k.pth --rerank_num=100

Local Matching using DINOv2+Registers

By adding registers, DINOv2 can achieve better local matching performance. A pre-trained DINOv2+registers model can be downloaded HERE.

You can simply add --registers to the (train or test) run command and load the model with registers to use the SelaVPR model based on DINOv2+registers backbone, for example

python3 train.py --datasets_folder=/path/to/your/datasets_vg/datasets --dataset_name=msls --queries_per_epoch=30000 --foundation_model_path=/path/to/pre-trained/dinov2_vitl14_reg4_pretrain.pth --registers

python3 eval.py --datasets_folder=/path/to/your/datasets_vg/datasets --dataset_name=msls --resume=/path/to/finetuned/msls/model/SelaVPR_reg4_msls.pth --rerank_num=100 --registers

The finetuned (on MSLS) SelaVPR model with registers can be downloaded HERE.

For the (dense or coarse) local matching between two images, run

python3 visualize_pairs.py --datasets_folder=./ --resume=/path/to/finetuned/msls/model/SelaVPR_reg4_msls.pth --registers

Efficient RAM Usage (optional)

The test_efficient_ram_usage() function in test.py is used to address the issue of RAM out of memory (this issue may cause the program to be killed). This function saves the extracted local features in ./output_local_features/ and loads only the local features currently needed into RAM each time. You can simply add --efficient_ram_testing to the (train or test) run command to use it, for example

python3 train.py --datasets_folder=/path/to/your/datasets_vg/datasets --dataset_name=pitts30k --queries_per_epoch=5000 --resume=/path/to/finetuned/msls/model/SelaVPR_msls.pth --efficient_ram_testing

python3 eval.py --datasets_folder=/path/to/your/datasets_vg/datasets --dataset_name=pitts30k --resume=/path/to/finetuned/pitts30k/model/SelaVPR_pitts30k.pth --rerank_num=100 --efficient_ram_testing

More Details about Datasets

MSLS-val: We use the official version of MSLS-val (only contains 740 query images) for testing, which is a subset of the MSLS-val formated by VPR-datasets-downloader (contains about 11k query images). More detail can be found here.

Nordland-test: Download the Downsampled version here.

Related Work

Our another work CricaVPR (one-stage VPR based on DINOv2) presents a multi-scale convolution-enhanced adaptation method and achieves SOTA performance on several datasets. The code is released at HERE.

Acknowledgements

Parts of this repo are inspired by the following repositories:

Visual Geo-localization Benchmark

DINOv2

Citation

If you find this repo useful for your research, please consider leaving a star⭐️ and citing the paper

@inproceedings{selavpr,
  title={Towards Seamless Adaptation of Pre-trained Models for Visual Place Recognition},
  author={Lu, Feng and Zhang, Lijun and Lan, Xiangyuan and Dong, Shuting and Wang, Yaowei and Yuan, Chun},
  booktitle={The Twelfth International Conference on Learning Representations},
  year={2024}
}

Lu-Feng/SelaVPR