/MGLT

[TIM 2024] Multi-granularity Localization Transformer with Collaborative Understanding for Referring Multi-Object Tracking

Primary LanguagePythonMIT LicenseMIT

MGLT

The official PyTorch implementation of the paper "Multi-granularity Localization Transformer with Collaborative Understanding for Referring Multi-Object Tracking".

Referring Multi-Object Tracking (RMOT) involves localizing and tracking specific objects in video frames by utilizing linguistic prompts as references. To enhance the effectiveness of linguistic prompts when training, we introduce a novel Multi-Granularity Localization Transformer with collaborative understanding, termed MGLT. Unlike previous methods focused on visual-language fusion and post-processing, MGLT reevaluates RMOT by preventing linguistic clues attenuation during propagation and poor collaborative localization ability. MGLT comprises two key components: Multi-Granularity Implicit Query Bootstrapping (MGIQB) and Multi-Granularity Track-Prompt Alignment (MGTPA). MGIQB ensures that tracking and linguistic features are preserved in later layers of network propagation by bootstrapping the model to generate text-relevant and temporal-enhanced track queries. \revised{Simultaneously, MGTPA with multi-granularity linguistic prompts enhances the model's localization ability by understanding the relative positions of different referred objects within a frame.} Extensive experiments on well-recognized benchmarks demonstrate that MGLT achieves state-of-the-art performance. Notably, it shows significant improvements on Refer-KITTI dataset of 2.73%, 7.95% and 3.18% in HOTA, AssA, and IDF1, respectively.

Framework

Preparation

Preparing data for Refer-KITTI and Refer-BDD.

Before training, please download the pretrained weights from Deformable DETR and CLIP-R50.

Then organizing project as follows:

├── refer-kitti
│   ├── KITTI
│           ├── training
│           ├── labels_with_ids
│   └── expression
├── refer-bdd
│   ├── BDD
│           ├── training
│           ├── labels_with_ids
│           ├── expression
├── weights
│   ├── r50_deformable_detr_plus_iterative_bbox_refinement-checkpoint.pth
│   ├── RN50.pt
...

Training

To do training of MGLT with 4 GPUs, run:

sh configs/r50_rmot_train.sh

Testing

To do evaluation of MGLT with 1 GPU, run:

sh configs/r50_rmot_test.sh

Result

The main results of MGLT:

Method Dataset HOTA DetA AssA DetRe DetPr AssRe AssRr LocA MOTA IDFI IDS URL
MGLT Refer-KITTI 49.25 37.09 65.50 49.28 58.72 69.88 88.23 91.10 21.13 55.91 2442 model
MGLT Refer-BDD 40.26 28.44 57.59 37.24 52.48 63.52 81.87 86.98 11.68 44.41 12935 model

License

This project is under the MIT license. See LICENSE for details.

Update

  • 2024.5.8 Release code and checkpoint.

  • 2024.3.25 Init repository.

Acknowledgements

Our project is based on RMOT and CO-MOT. Many thanks to these excellence works.