/Token

Official implementation of the AAAI 2022 paper "Learning Token-based Representation for Image Retrieval"

Primary LanguagePythonMIT LicenseMIT

Token: Token-Based Representation for Image Retrieval

πŸ†•βœ…πŸŽ‰ updated: 21th April 2022: We release different model weights.

πŸ†•βœ…πŸŽ‰ updated: 15th December 2021: We extend the proposed aggregation method to a variety of existing local features.

mAP performance of the proposed model

We provide results of Token. mAP is computed with Medium and Hard evaluation protocols.

Model ROxf (M) ROxf + R1M (M) RPar (M) RPar + R1M (M) ROxf (H) ROxf + R1M (H) RPar (H) RPar + R1M (H)
R101-HOW + VLAD 73.54 60.38 82.33 62.56 51.93 33.17 66.95 41.82
R101-HOW + ASMK 80.42 70.17 85.43 68.80 62.51 45.36 70.76 45.39
R101-NetVLAD 73.91 60.51 86.81 71.31 56.45 37.92 73.61 48.98
R101-RMAC 75.14 61.88 85.28 67.37 53.77 36.45 71.28 44.01
R101-SOLAR 79.65 67.61 88.63 73.21 59.99 41.14 76.15 50.98
R101-DELG 78.55 66.02 88.58 73.65 60.89 41.75 76.05 51.46
R50-Token 79.79 67.36 88.08 74.33 62.68 45.70 75.49 52.68
R101-Token 82.16 70.58 89.40 77.24 65.75 47.46 78.44 56.81

Local feature aggregation framework

The new framework diagram is shown below. For an image, we extract its local features and aggregate them using our proposed method. Aggregator

The table below shows the performance comparison between ASMK aggregation and our proposed aggregation method. The ASMK aggregator requires offline clustering to generate large codebooks, and our aggregator requires supervised training, where we utilize the GLDv2 dataset for training.

Model ROxf (M) RPar (M) ROxf (H) RPar (H)
R50-HOW + ASMK 79.4 81.6 56.9 62.4
R50-HOW+Ours 80.7 86.5 60.9 72.0
R101-HOW + ASMK 80.4 85.4 62.5 70.8
R101-HOW + Ours 83.2 87.7 64.8 75.3
R50-DELF + ASMK 67.8 76.9 43.1 55.4
R50-DELF + Ours 75.2 86.0 55.0 72.2

PyTorch training code for Token-based Representation for Image Retrieval. We propose a joint local feature learning and aggregation framework, obtaining 82.3 mAP on ROxf with Medium evaluation protocols. Inference in 50 lines of PyTorch.

Token

What it is. Given an image, Token first uses a CNN and a Local Feature Self-Attention (LFSA) module to extract local features $F_c$. Then, they are tokenized into $L$ visual tokens with spatial attention. Further, a refinement block is introduced to enhance the obtained visual tokens with self-attention and cross-attention. Finally, Token concatenates all the visual tokens to form a compact global representation $f_g$ and reduce its dimension. The aggreegated global feature is discriminative and efficient.

About the code. Token is very simple to implement and experiment with. Training code follows this idea - it is not a library, but simply a train.py importing model and criterion definitions with standard training loops.

Requirements

  • Python 3
  • cuda 11.0
  • PyTorch tested on 1.8.0, torchvision 0.9.0
  • numpy
  • matplotlib

Usage - Representation learning

There are no extra compiled components in Token and package dependencies are minimal, so the code is very simple to use. We provide instructions how to install dependencies via conda. Install PyTorch 1.8.0 and torchvision 0.9.0:

conda install -c pytorch pytorch torchvision

Data preparation

Before going further, please check out Google landmarkv2 github. We use their training images. If you use this code in your research, please also cite their work!

Download and extract Google landmarkv2 train and val images with annotations from https://github.com/cvdfoundation/google-landmark.

Download ROxf and RPar datastes with annotations. We expect the directory structure to be the following:

/data/
  β”œβ”€ Google-landmark-v2 # train images
  β”‚   β”œβ”€ train.csv
  β”‚   β”œβ”€ train_clean.csv
  β”‚   β”œβ”€ GLDv2-clean-train-split.pkl
  β”‚   β”œβ”€ GLDv2-clean-val-split.pkl
  |   └─ train
  └─test # test images
      β”œβ”€ roxford5k
      |   β”œβ”€ jpg
      |   └─ gnd_roxford5k.pkl
      └─ rparis6k
          β”œβ”€ jpg
          └─ gnd_rparis6k.pkl

Training

To train Token on a single node with 4 gpus for 30 epochs run:

sh experiment.sh

A single epoch takes 2.5 hours, so 30 epoch training takes around 3 days on a single machine with 4 3090Ti cards.

We train Token with SGD setting learning rate to 0.01. The refinement block is trained with dropout of 0.1, and linearly decaying scheduler is adopted to gradually decay the learning rate to 0 when the desired number of steps is reached.

Evaluation

To evaluate on Roxf and Rparis with a single GPU run:

python test.py

and get results as below

>> Test Dataset: roxford5k *** local aggregation >>
>> mAP Medium: 82.28, Hard: 66.57

>> Test Dataset: rparis6k *** local aggregation >>
>> mAP Medium: 89.34, Hard: 78.56

We found that there is a change in performance when the test environment is different, for example, when the environment is GeForce RTX 2080Ti with cuda 10.2, pytorch 1.7.1 and torchvision 0.8.2, the test performance is

>> Test Dataset: roxford5k *** local aggregation >>
>> mAP Medium: 81.36, Hard: 62.09

>> Test Dataset: rparis6k *** local aggregation >>
>> mAP Medium: 90.19, Hard: 80.16

Qualitative examples

Selected qualitative examples of different methods. Top-11 results are shown in the figure. The image with green denotes the true positives and the red bounding boxes are false positives.

Token