Integrating Boxes and Masks: A Multi-Object Framework for Unified Visual Tracking and Segmentation. [Arxiv]
Tracking any given object(s) spatially and temporally is a common purpose in Visual Object Tracking (VOT) and Video Object Segmentation (VOS). We proposed a Multi-object Mask-box Integrated framework for unified Tracking and Segmentation, dubbed MITS.
- Firstly, the unified identification module is proposed to support both box and mask reference for initialization, where detailed object information is inferred from boxes or directly retained from masks.
- Additionally, a novel pinpoint box predictor is proposed for accurate multi-object box prediction, facilitating target-oriented representation learning.
- All target objects are processed simultaneously from encoding to propagation and decoding, which enables our framework to handle complex scenes with multiple objects efficiently.
- Python3
- pytorch >= 1.7.0 and torchvision
- opencv-python
- Pillow
- Pytorch Correlation. Recommend to install from source instead of using
pip
:git clone https://github.com/ClementPinard/Pytorch-Correlation-extension.git cd Pytorch-Correlation-extension python setup.py install cd -
- VOS datasets: YouTube-VOS 2019 train, DAVIS 2017 train
- VOT datasets: LaSOT train, GOT-10K train
Download datasets and re-organize the folders as the following structure:
datasets
└───YTB
└───2019
└───train
└───JPEGImages
└───Annotations
└───DAVIS
└───JPEGImages
└───Annotations
└───ImageSets
└───LaSOT
└───JPEGImages
└───Annotations
└───BoxAnnotations
└───GOT10K
└───JPEGImages
└───Annotations
└───BoxAnnotations
- JPEGImages and Annotations includes subfolders for video sequences containing images and masks. BoxAnnotations contains txt annotation files for every video sequence.
- The standard training of MITS keeps the same training data as a prior work RTS. The download link for pseudo-masks for LaSOT and GOT10k can be found here. The corresponding folders are Annotations of LaSOT and GOT10K.
- For LaSOT training set, training frames are sampled every 5 frames from the original sequences, according to the pseudo-masks provided by RTS. This may require to sample frame images and box annotations from the original LaSOT data source.
Note: Although the pseudo masks are used for training by default, MITS can also be trained with mixed annotations without pseudo masks due to its strong compatibility.
MITS is initialized with pretrained DeAOT. Download the R50_DeAOTL_PRE.pth and put it to pretrain_models folder.
Run train.sh
to launch training. Configs for different training settings are in folder configs. See model zoo for details.
- VOS datasets: YouTube-VOS 2019 valid, DAVIS 2017 valid
- VOT datasets: LaSOT test, TrackingNet test, GOT-10K test
We follow the original file structure from each dataset. First frame mask for YouTube-VOS/DAVIS and first frame box for LaSOT/TrackingNet/GOT-10K are required for evaluation.
datasets
└───YTB
└───2019
└───valid
└───JPEGImages
└───Annotations
└───DAVIS
└───JPEGImages
└───Annotations
└───ImageSets
└───LaSOTTest
└───airplane-1
└───img
└───groundtruth.txt
└───TrackingNetTest
└───JPEGImages
└───__WaG8fRMto_0
└───BoxAnnotations
└───__WaG8fRMto_0.txt
└───GOT10KTest
└───GOT-10k_Test_000001
└───00000001.jpg
...
└───groundtruth.txt
Run eval_vos.sh
to evaluate on YouTube-VOS or DAVIS, eval_vot.sh
to evaluate on LaSOT, TrackingNet or GOT10K.
The outputs include predicted masks from mask head, bounding boxes from masks (bbox), predicted boxes from box head (boxh). By default, masks are for VOS benchmarks and boxh boxes are for VOT benchmarks.
Model | Training Data | File |
---|---|---|
MITS | full | gdrive |
MITS_box | no VOT masks | gdrive |
MITS_got | only GOT10k | gdrive |
Model | LaSOT Test AUC/PN/P |
TrackingNet Test AUC/PN/P |
GOT10k Test AO/SR0.5/SR0.75 |
YouTube-VOS 19 val G |
DAVIS 17 val G |
---|---|---|---|---|---|
MITS Prediction file |
72.1/80.1/78.6 gdrive |
83.5/88.7/84.5 gdrive |
78.5/87.5/73.7 gdrive |
85.9 gdrive |
84.9 gdrive |
MITS_box Prediction file |
70.7/78.1/75.8 gdrive |
83.0/87.8/83.1 gdrive |
78.0/86.4/71.7 gdrive |
85.7 gdrive |
84.3 gdrive |
MITS_got Prediction file |
- | - | 80.4/89.7/75.9 gdrive |
- | - |
By default, we use box prediction for VOT benchmarks and mask prediction for VOS benchmarks. There might be 0.1 performance difference with those reported in the paper due to the code update.
The implementation is heavily based on prior VOS work AOT/DeAOT.
Pseudo-masks for LaSOT and GOT10K for training are taken from RTS.
@article{xu2023integrating,
title={Integrating Boxes and Masks: A Multi-Object Framework for Unified Visual Tracking and Segmentation},
author={Xu, Yuanyou and Yang, Zongxin and Yang, Yi},
journal={arXiv preprint arXiv:2308.13266},
year={2023}
}
@inproceedings{yang2022deaot,
title={Decoupling Features in Hierarchical Propagation for Video Object Segmentation},
author={Yang, Zongxin and Yang, Yi},
booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
year={2022}
}
@article{yang2021aost,
title={Scalable Video Object Segmentation with Identification Mechanism},
author={Yang, Zongxin and Wang, Xiaohan and Miao, Jiaxu and Wei, Yunchao and Wang, Wenguan and Yang, Yi},
journal={arXiv preprint arXiv:2203.11442},
year={2023}
}
@inproceedings{yang2021aot,
title={Associating Objects with Transformers for Video Object Segmentation},
author={Yang, Zongxin and Wei, Yunchao and Yang, Yi},
booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
year={2021}
}