/ChatterBox

ChatterBox: Multi-round Multimodal Referring and Grounding, Multimodal, Multi-round dialogues

Primary LanguagePythonApache License 2.0Apache-2.0

ChatterBox

Alt text for the image ChatterBox: Multi-round Multimodal Referring and Grounding

Yunjie Tian*1, Tianren Ma*1, Lingxi Xie2, Jihao Qiu1, Xi Tang1, Yuan Zhang1, Jianbin Jiao1, Qi Tian2, Qixiang Ye1

1 University of Chinese Academy of Sciences, 2 HUAWEI Inc.

Paper: (arXiv 2401.13307)

Abstract

In this study, we establish a baseline for a new task named multimodal multi-round referring and grounding (MRG), opening up a promising direction for instance-level multimodal dialogues. We present a new benchmark and an efficient vision-language model for this purpose. The new benchmark, named CB-300K, spans challenges including multi-round dialogue, complex spatial relationships among multiple instances, and consistent reasoning, which are beyond those shown in existing benchmarks. The proposed model, named ChatterBox, utilizes a two-branch architecture to collaboratively handle vision and language tasks. By tokenizing instance regions, the language branch acquires the ability to perceive referential information. Meanwhile, ChatterBox feeds a query embedding in the vision branch to a token receiver for visual grounding. A two-stage optimization strategy is devised, making use of both CB-300K and auxiliary external data to improve the model's stability and capacity for instance-level understanding. Experiments show that ChatterBox outperforms existing models in MRG both quantitatively and qualitatively, paving a new path towards multimodal dialogue scenarios with complicated and precise interactions.

Overview


The architecture of the ChatterBox model.

Key Contributions:

  • CB-300K - We establish the CB-300K benchmark to facilitate the research in multi-round referring and grounding.
  • Chatterbox Model - We establish the ChatterBox model in a dual-branch architecture to solve multi-round referring and grounding problem.

We clarify that our multi-round referring and grounding (MRG) is distinct from multi single-round referring and grounding. The logical coherence between dialogues (see CB-LC), is crucial for an interactive chat agent, and we are the first on this problem.

Updates

  • Jan. 24th, 2024: The paper, code, and dataset is released.

Release

Contents

Install

  1. Clone this repository and navigate to ChatterBox folder
git clone https://github.com/sunsmarterjie/ChatterBox
cd ChatterBox
  1. Install Packages
conda create -n chatterbox python=3.11.5 
conda activate chatterbox
pip install --upgrade pip  # enable PEP 660 support
pip install -r requirements.txt
pip install deepspeed==0.11.1
unzip mmcv-1.4.7.zip
cd mmcv-1.4.7/
MMCV_WITH_OPS=1 pip install -e .
cd ../model/GroundingDINO/ops
python setup.py build install

Train

We build visual branch of ChatterBox using GroundingDINO and DINO, we provide GroundDINO version now.

  • Prepare datasets/models:

Download CB-300K, VG, COCO2017, COCO2014, RefCOCO, RefCOCO+, RefCOCOg, Flickr30K, OpenSource, clip-vit-large-patch14, LLaVA-Instruct-150K, llava-llama-2-13b, CB-materials, groundingdino_swinb.

├── datasets
|   ├── CB-300K
|   |    ├── CB-MRG
|   |    ├── CB-LC
│   │    └── ...
|   ├── VG
|   |    ├── VG_100K
|   |    ├── VG_100K_2
│   │    └── ...
│   ├── MSCOCO2017
|   |    ├── train2017
│   │    └── ...
│   ├── MSCOCO2014
|   |    ├── train2014
│   │    └── ...
│   ├── Flickr30K
|   |    ├── flickr30k-images
│   │    └── ...
│   ├── llava_instruct_150k.json
|   ├── CB_materials
|            ├── CB-refcoco-GND
|            ├── CB-coco-GND
|            ├── CB-refcoco-REF
│            └── ...
│── clip-vit-large-patch14
|             ├── config.json
│             └── ...
│── llava-llama-2-13b-chat-lightning-preview
|                      ├── config.json
│                      └── ...
│── OpenSource
|        ├── finetune_refcoco_train.json
|        ├── finetune_refcoco+_train.json
│        └── ...
├── groundingdino_swinb_cogcoor.pth
  • Train ChatterBox on 8xA800 GPUs (80GB).
python startup_stage1.py  # stage1
python startup_stage2.py  # stage2

Evaluation

See details at evaluation.

Demo

Coming soon

Citation

If this project has been helpful or if you've used our dataset, please cite:

@article{tian2024chatterbox,
  title={ChatterBox: Multi-round Multimodal Referring and Grounding},
  author={Tian, Yunjie and Ma, Tianren and Xie, Lingxi and Qiu, Jihao and Tang, Xi and Zhang, Yuan and Jiao, Jianbin and Tian, Qi and Ye, Qixiang},
  journal={arXiv preprint arXiv:2401.13307},
  year={2024}
}

Acknowledgment

This project is based on LLaVA (paper, code), LISA (paper, code), GPT4RoI (paper, code), thanks for their excellent works.