/MITS

Primary LanguagePythonBSD 3-Clause "New" or "Revised" LicenseBSD-3-Clause

MITS

Introduction

Integrating Boxes and Masks: A Multi-Object Framework for Unified Visual Tracking and Segmentation. [Arxiv]

Tracking any given object(s) spatially and temporally is a common purpose in Visual Object Tracking (VOT) and Video Object Segmentation (VOS). We proposed a Multi-object Mask-box Integrated framework for unified Tracking and Segmentation, dubbed MITS.

  • Firstly, the unified identification module is proposed to support both box and mask reference for initialization, where detailed object information is inferred from boxes or directly retained from masks.
  • Additionally, a novel pinpoint box predictor is proposed for accurate multi-object box prediction, facilitating target-oriented representation learning.
  • All target objects are processed simultaneously from encoding to propagation and decoding, which enables our framework to handle complex scenes with multiple objects efficiently.

Requirements

  • Python3
  • pytorch >= 1.7.0 and torchvision
  • opencv-python
  • Pillow
  • Pytorch Correlation. Recommend to install from source instead of using pip:
    git clone https://github.com/ClementPinard/Pytorch-Correlation-extension.git
    cd Pytorch-Correlation-extension
    python setup.py install
    cd -

Getting Started

Training

Data Preparation

  • VOS datasets: YouTube-VOS 2019 train, DAVIS 2017 train
  • VOT datasets: LaSOT train, GOT-10K train

Download datasets and re-organize the folders as the following structure:

datasets
└───YTB
    └───2019
        └───train
            └───JPEGImages
            └───Annotations
└───DAVIS
    └───JPEGImages
    └───Annotations
    └───ImageSets
└───LaSOT
    └───JPEGImages
    └───Annotations
    └───BoxAnnotations
└───GOT10K
    └───JPEGImages
    └───Annotations
    └───BoxAnnotations
  • JPEGImages and Annotations includes subfolders for video sequences containing images and masks. BoxAnnotations contains txt annotation files for every video sequence.
  • The standard training of MITS keeps the same training data as a prior work RTS. The download link for pseudo-masks for LaSOT and GOT10k can be found here. The corresponding folders are Annotations of LaSOT and GOT10K.
  • For LaSOT training set, training frames are sampled every 5 frames from the original sequences, according to the pseudo-masks provided by RTS. This may require to sample frame images and box annotations from the original LaSOT data source.

Note: Although the pseudo masks are used for training by default, MITS can also be trained with mixed annotations without pseudo masks due to its strong compatibility.

Pretrain Weights

MITS is initialized with pretrained DeAOT. Download the R50_DeAOTL_PRE.pth and put it to pretrain_models folder.

Train Models

Run train.sh to launch training. Configs for different training settings are in folder configs. See model zoo for details.

Evaluation

Data Preparation

  • VOS datasets: YouTube-VOS 2019 valid, DAVIS 2017 valid
  • VOT datasets: LaSOT test, TrackingNet test, GOT-10K test

We follow the original file structure from each dataset. First frame mask for YouTube-VOS/DAVIS and first frame box for LaSOT/TrackingNet/GOT-10K are required for evaluation.

datasets
└───YTB
    └───2019
        └───valid
            └───JPEGImages
            └───Annotations
└───DAVIS
    └───JPEGImages
    └───Annotations
    └───ImageSets
└───LaSOTTest
    └───airplane-1
        └───img
        └───groundtruth.txt
└───TrackingNetTest
    └───JPEGImages
        └───__WaG8fRMto_0
    └───BoxAnnotations
        └───__WaG8fRMto_0.txt
└───GOT10KTest
    └───GOT-10k_Test_000001
        └───00000001.jpg
        ...
        └───groundtruth.txt

Evaluate Models

Run eval_vos.sh to evaluate on YouTube-VOS or DAVIS, eval_vot.sh to evaluate on LaSOT, TrackingNet or GOT10K.

The outputs include predicted masks from mask head, bounding boxes from masks (bbox), predicted boxes from box head (boxh). By default, masks are for VOS benchmarks and boxh boxes are for VOT benchmarks.

Model Zoo

Model Download

Model Training Data File
MITS full gdrive
MITS_box no VOT masks gdrive
MITS_got only GOT10k gdrive

Evaluation Results

Model LaSOT Test
AUC/PN/P
TrackingNet Test
AUC/PN/P
GOT10k Test
AO/SR0.5/SR0.75
YouTube-VOS 19 val
G
DAVIS 17 val
G
MITS
Prediction file
72.1/80.1/78.6
gdrive
83.5/88.7/84.5
gdrive
78.5/87.5/73.7
gdrive
85.9
gdrive
84.9
gdrive
MITS_box
Prediction file
70.7/78.1/75.8
gdrive
83.0/87.8/83.1
gdrive
78.0/86.4/71.7
gdrive
85.7
gdrive
84.3
gdrive
MITS_got
Prediction file
- - 80.4/89.7/75.9
gdrive
- -

By default, we use box prediction for VOT benchmarks and mask prediction for VOS benchmarks. There might be 0.1 performance difference with those reported in the paper due to the code update.

Acknowledgement

The implementation is heavily based on prior VOS work AOT/DeAOT.

Pseudo-masks for LaSOT and GOT10K for training are taken from RTS.

Citing

@article{xu2023integrating,
  title={Integrating Boxes and Masks: A Multi-Object Framework for Unified Visual Tracking and Segmentation},
  author={Xu, Yuanyou and Yang, Zongxin and Yang, Yi},
  journal={arXiv preprint arXiv:2308.13266},
  year={2023}
}
@inproceedings{yang2022deaot,
  title={Decoupling Features in Hierarchical Propagation for Video Object Segmentation},
  author={Yang, Zongxin and Yang, Yi},
  booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
  year={2022}
}
@article{yang2021aost,
  title={Scalable Video Object Segmentation with Identification Mechanism},
  author={Yang, Zongxin and Wang, Xiaohan and Miao, Jiaxu and Wei, Yunchao and Wang, Wenguan and Yang, Yi},
  journal={arXiv preprint arXiv:2203.11442},
  year={2023}
}
@inproceedings{yang2021aot,
  title={Associating Objects with Transformers for Video Object Segmentation},
  author={Yang, Zongxin and Wei, Yunchao and Yang, Yi},
  booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
  year={2021}
}