/SBT

Primary LanguagePythonMIT LicenseMIT

Fei Xie, Chunyu Wang, Guangting Wang, Yue Cao, Wankou Yang, Wenjun Zeng

⭐ This is the official reproduced version of our CVPR2022 work "Correlation-Aware Deep Tracking".

⭐ For an improved single-branch tracking model, SuperSBT, please go to another github repository!

compare (a1) standard Siamese-like feature extraction; (a2) our target-dependent feature extraction; (b1) correlation step, such as Siamese cropping correlation [23], DCF [11] and Transformer-based correlation [5] ; (b2) our pipeline removes separated correlation step; (c) prediction stage; (d1)/(d2) are the TSNE [38] visualizations of search features in (a1)/(a2) when feature networks go deeper

arch (a) architecture of our proposed Single Branch Transformer for tracking. Different from Siamese, DCF and Transformer-based methods, it does not have a standalone module for computing correlation. Instead, it embeds correlation in all Cross-Attention layers which exist at different levels of the networks. The fully fused features of the search image are directly fed to Classification Head (Cls Head) and Regression Head (Reg Head) to obtain localization and size embedding maps. (b) shows the structure of a Extract-or-Correlation (EoC) block. (c) shows the difference of EoC-SA and EoC-CA. PaE denotes patch embedding. LN denotes layer normalization.

Abstract

Robustness and discrimination power are two fundamental requirements in visual object tracking. In most tracking paradigms, we find that the features extracted by the popular Siamese-like networks cannot fully discriminatively model the tracked targets and distractor objects, hindering them from simultaneously meeting these two requirements. While most methods focus on designing robust correlation operations, we propose a novel target-dependent feature network inspired by the self-/cross-attention scheme. In contrast to the Siamese-like feature extraction, our network deeply embeds cross-image feature correlation in multiple layers of the feature network. By extensively matching the features of the two images through multiple layers, it is able to suppress non-target features, resulting in instance-varying feature extraction. The output features of the search image can be directly used for predicting target locations without extra correlation step. Moreover, our model can be flexibly pre-trained on abundant unpaired images, leading to notably faster convergence than the existing methods. Extensive experiments show our method achieves the state-of-the-art results while running at real-time. Our feature networks also can be applied to existing tracking pipelines seamlessly to raise the tracking performance.

Model file and results

models and raw results can be downloaded from Baidu NetDisk (password:ne0x):

[Models, Raw resuls and Training logs(password:ne0x)]

Results

We obtain the state-of-the-art results on several benchmarks while running at high speed. More results are coming soon.

Model GOT-10k
AO (%)
GOT-10k
SR0.5 (%)
GOT-10k
SR0.75 (%)
Speed
Params
SBT-base 69.7 79.9 64.1 40fps 25.1M
Model LaSOT
AUC (%)
LaSOT
Precision
LaSOT
Norm. Precision
Speed
Params
SBT-base 68.0 73.9 77.8 40fps 25.1M

Install dependencies

  • Docker image

    We also provide a docker image for reproducing our results:
    jaffe03/dualtfrpp:latest
    
  • Create and activate a conda environment

    conda create -n SBT python=3.7
    conda activate SBT
  • Install PyTorch

    conda install -c pytorch pytorch=1.6 torchvision=0.7.1 cudatoolkit=10.2
  • Install other packages

    conda install matplotlib pandas tqdm
    pip install opencv-python tb-nightly visdom scikit-image tikzplotlib gdown
    conda install cython scipy
    sudo apt-get install libturbojpeg
    pip install pycocotools jpeg4py
    pip install wget yacs
    pip install shapely==1.6.4.post2
    pip install mmcv timm
  • Setup the environment
    Create the default environment setting files.

For training

  • Full dataset training (lasot, got10k, coco, trackingnet):
    python -m torch.distributed.launch --nproc_per_node 8 lib/train/run_training_sbt.py --script sbt --config sbt_base --save_dir ./
  • got10k dataset training (lasot, got10k, coco, trackingnet):
    python -m torch.distributed.launch --nproc_per_node 8 lib/train/run_training_sbt.py --script sbt --config sbt_base_got --save_dir ./

For testing

  • For examplem, in lasot testing set:
    python ./tracking/test.py --tracker_name sbt --tracker_param sbt_base --dataset lasot --threads 0
    python ./tracking/analysis_results_ITP.py --script  sbt --config sbt_base

Acknowledgement

This is a modified version of the python framework PyTracking based on Pytorch, also borrowing from PySOT, GOT-10k and Vision Transformer, such as Swin Transformer, PVT, Twins. We would like to thank their authors for providing great code and framework.

Contacts

Citing SBT

If you find SBT useful in your research, please consider citing:

@inproceedings{xie2022sbt,
  title={Correlation-aware deep tracking},
  author={Xie, Fei and Wang, Chunyu and Wang, Guangting and Cao, Yue and Yang, Wankou and Zeng, Wenjun},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2022}
}