ChatterBox

ChatterBox: Multi-round Multimodal Referring and Grounding

Yunjie Tian*¹, Tianren Ma*¹, Lingxi Xie², Jihao Qiu¹, Xi Tang¹, Yuan Zhang¹, Jianbin Jiao¹, Qi Tian², Qixiang Ye¹

¹ University of Chinese Academy of Sciences, ² HUAWEI Inc.

Abstract

In this study, we establish a baseline for a new task named multimodal multi-round referring and grounding (MRG), opening up a promising direction for instance-level multimodal dialogues. We present a new benchmark and an efficient vision-language model for this purpose. The new benchmark, named CB-300K, spans challenges including multi-round dialogue, complex spatial relationships among multiple instances, and consistent reasoning, which are beyond those shown in existing benchmarks. The proposed model, named ChatterBox, utilizes a two-branch architecture to collaboratively handle vision and language tasks. By tokenizing instance regions, the language branch acquires the ability to perceive referential information. Meanwhile, ChatterBox feeds a query embedding in the vision branch to a token receiver for visual grounding. A two-stage optimization strategy is devised, making use of both CB-300K and auxiliary external data to improve the model's stability and capacity for instance-level understanding. Experiments show that ChatterBox outperforms existing models in MRG both quantitatively and qualitatively, paving a new path towards multimodal dialogue scenarios with complicated and precise interactions.

Overview

The architecture of the ChatterBox model.

Key Contributions:

CB-300K - We establish the CB-300K benchmark to facilitate the research in multi-round referring and grounding.
Chatterbox Model - We establish the ChatterBox model in a dual-branch architecture to solve multi-round referring and grounding problem.

We clarify that our multi-round referring and grounding (MRG) is distinct from multi single-round referring and grounding. The logical coherence between dialogues (see CB-LC), is crucial for an interactive chat agent, and we are the first on this problem.

Updates

Jan. 24th, 2024: The paper, code, and dataset is released.

Install

Clone this repository and navigate to ChatterBox folder

git clone https://github.com/sunsmarterjie/ChatterBox
cd ChatterBox

Install Packages

conda create -n chatterbox python=3.11.5 
conda activate chatterbox
pip install --upgrade pip  # enable PEP 660 support
pip install -r requirements.txt
pip install deepspeed==0.11.1
unzip mmcv-1.4.7.zip
cd mmcv-1.4.7/
MMCV_WITH_OPS=1 pip install -e .
cd ../model/GroundingDINO/ops
python setup.py build install

Train

We build visual branch of ChatterBox using GroundingDINO and DINO, we provide GroundDINO version now.

Prepare datasets/models:

Download CB-300K, VG, COCO2017, COCO2014, RefCOCO, RefCOCO+, RefCOCOg, Flickr30K, OpenSource, clip-vit-large-patch14, LLaVA-Instruct-150K, llava-llama-2-13b, CB-materials, groundingdino_swinb.

├── datasets
|   ├── CB-300K
|   |    ├── CB-MRG
|   |    ├── CB-LC
│   │    └── ...
|   ├── VG
|   |    ├── VG_100K
|   |    ├── VG_100K_2
│   │    └── ...
│   ├── MSCOCO2017
|   |    ├── train2017
│   │    └── ...
│   ├── MSCOCO2014
|   |    ├── train2014
│   │    └── ...
│   ├── Flickr30K
|   |    ├── flickr30k-images
│   │    └── ...
│   ├── llava_instruct_150k.json
|   ├── CB_materials
|            ├── CB-refcoco-GND
|            ├── CB-coco-GND
|            ├── CB-refcoco-REF
│            └── ...
│── clip-vit-large-patch14
|             ├── config.json
│             └── ...
│── llava-llama-2-13b-chat-lightning-preview
|                      ├── config.json
│                      └── ...
│── OpenSource
|        ├── finetune_refcoco_train.json
|        ├── finetune_refcoco+_train.json
│        └── ...
├── groundingdino_swinb_cogcoor.pth

Train ChatterBox on 8xA800 GPUs (80GB).

python startup_stage1.py  # stage1
python startup_stage2.py  # stage2

Evaluation

See details at evaluation.

Demo

Coming soon

Citation

If this project has been helpful or if you've used our dataset, please cite:

@article{tian2024chatterbox,
  title={ChatterBox: Multi-round Multimodal Referring and Grounding},
  author={Tian, Yunjie and Ma, Tianren and Xie, Lingxi and Qiu, Jihao and Tang, Xi and Zhang, Yuan and Jiao, Jianbin and Tian, Qi and Ye, Qixiang},
  journal={arXiv preprint arXiv:2401.13307},
  year={2024}
}

Acknowledgment

This project is based on LLaVA (paper, code), LISA (paper, code), GPT4RoI (paper, code), thanks for their excellent works.

sunsmarterjie/ChatterBox