Official implementation and datasets for AddressCLIP: Empowering Vision-Language Models for City-wide Image Address Localization, accepted at ECCV 2024.



Abstract. In this study, we introduce a new problem raised by social media and photojournalism, named Image Address Localization (IAL), which aims to predict the readable textual address where an image was taken. Existing two-stage approaches involve predicting geographical coordinates and converting them into human-readable addresses, which can lead to ambiguity and be resource-intensive. In contrast, we propose an end-to-end framework named AddressCLIP to solve the problem with more semantics, consisting of two key ingredients: i) image-text alignment to align images with addresses and scene captions by contrastive learning, and ii) imagegeography matching to constrain image features with the spatial distance in terms of manifold learning. Additionally, we have built three datasets from Pittsburgh and San Francisco on different scales specifically for the IAL problem. Experiments demonstrate that our approach achieves compelling performance on the proposed datasets and outperforms representative transfer learning methods fof vision-language models. Furthermore, extensive ablations and visualizations exhibit the effectiveness of the proposed method.


python == 3.8
clip == 1.0
torch == 2.1.1
torchvision == 0.16.1

Image Address Localization Datasets


Download the annotations and splits of IAL-datasets used in the paper from Baidu Cloud link or Google Drive link.

Feel free to use the addresses and captions in it!


Download Pittsburgh-250k

Download the original Pittsburgh-250k dataset from here.


Extract the .zip files and put them all into ./datasets/Pitts-IAL/ folder.

SF-IAL-Base & SF-IAL-Large

Download CosPlace

Follow the instruction from the CosPlace to obtain the original SF-XL dataset.

We only used the images in the /processed folder. Download it and put it into ./datasets/processed/ folder.


If this project is helpful for you, please cite our paper:

title={AddressCLIP: Empowering Vision-Language Models for City-wide Image Address Localization},
author={Xu, Shixiong and Zhang, Chenghao and Fan, Lubin and Meng, Gaofeng and Xiang, Shiming and Ye, Jieping},
booktitle={European Conference on Computer Vision (ECCV)},


This repository makes liberal use of code from CLIP, open_clip and LAVIS.