🌏 Project Page • 🤗 Demo • ▶️ YouTube • 📺 Bilibili 📀 Data • 📊 Benchmark • 📦 LAMM Models

Official Repository of LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark

Updates

📆 Coming Soon

Code for less GPU memory will be released soon. Please stay tuned.

📆 [2023-07-28]

Checkpoints of LAMM on huggingface updated on new code base. LAMM Performance updated, Please check it out.

📆 [2023-07-06]

Evaluation code for both 2D and 3D tasks are ready.
3D Benchmark meta files & 2D Instruction images files updated! Missing files and missing keys fixed. Please update accordingly.
Update scripts for demo in command line.

📆 [2023-06-30]

Watch demo video for LAMM at Youtube and Bilibili!

📆 [2023-06-20]

Full paper with Appendix is online.

📆 [2023-06-16]

LAMM dataset is available for Research community!

📆 [2023-06-12]

GPT Evaluation part available.
Our Paper will release tomorrow. Please stay tuned!

📆 [2023-06-11]

LAMM code is available for Research community!
Try out the Interactive Demo on Huggingface! (Time to build app depends on the server load)

Demos

Online Demo

For cases of 2D images, we provide an online demo deployed on huggingface spaces.

Due to limitation of hardware capacity, online version only supports LLM of 7B parameters and load pretrained model takes few minutes.

CLI Demo

We also provide a CLI demo for local test. Point cloud data are required to be in format of npy, we suggest to use data from LAMM-Benchmark-3D.

    cd ./src
    python cli_demo.py \
        --model lamm_peft \
        --vision_type pcl or image \
        --encoder_pretrain epcl or clip \
        --encoder_ckpt_path $EPCL_CKPT_PATH or '' \
        --vicuna_ckpt_path $LLM_CKPT_PATH \
        --delta_ckpt_path $LAMM_CKPT_PATH

Overview

Large language models have become a potential pathway toward achieving artificial general intelligence. Recent works on multi-modal large language models have demonstrated their effectiveness in handling visual modalities. In this work, we extend the research of MLLMs to point clouds and present the LAMM-Dataset and LAMM-Benchmark for 2D image and 3D point cloud understanding. We also establish an extensible framework to facilitate the extension of MLLMs to additional modalities. Our main contribution is three-fold: 1) We present the LAMM-Dataset and LAMM-Benchmark, which cover almost all high-level vision tasks for 2D and 3D vision. Extensive experiments validate the effectiveness of our dataset and benchmark. 2) We demonstrate the detailed methods of constructing instruction-tuning datasets and benchmarks for MLLMs, which will enable future research on MLLMs to scale up and extend to other domains, tasks, and modalities faster. 3) We provide a primary but potential MLLM training framework optimized for modalities' extension. We also provide baseline models, comprehensive experimental observations, and analysis to accelerate future research.

LAMM-Dataset

LAMM-Dataset is a comprehensive multi-modal instruction tuning dataset, which contains 186K language-image instruction-response pairs, and 10K lanuage-3D instruction-response pairs.In LAMM-Dataset, the instruction-response pairs are gathered from 8 image datasets and 4 point cloud datasets. Here we design four type of multi-modal instruction-response pairs,

C1: n-round daily dialogue focuses on multi-modal daily conversations.
C2: n-round factual knowledge dialogue aims at factual knowledge reasoning.
C3: 1-round detailed description aims to elaborate images and 3D scenes in texts.
C4: 1-round visual task dialogue transfers various vision tasks into instruction-response pairs, aiming at enhancing generalizability towards domain tasks in other modalities.

Download

Download LAMM-Dataset from here.

If you would like to download the entire LAMM Dataset and LAMM Benchmark, you can do so from the opendatalab website using the provided LAMM link. Here is the table illustrating the correspondence between each Meta file and image collection in the LAMM dataset:

Instruction Data For Training

2D_Instruct data

Meta file name	size	Image file name	size
daily_dialogue_49k.json	112M	coco_images.zip	7.8G
detailed_description_49k.json	65.5M	coco_images.zip	7.8G
factual_knowledge_dialogue_42k.json	83.2M	bamboo_images.zip	5.4G
vision_task_dialogue_46k.json	64.8M	coco_images.zip, bamboo_images.zip, locount_images.zip, textvqa_images.zip	9.2G

3D_Instruct data

Meta file name size Image file name size

LAMM_3dinstruct_10k.json 19.6M 3rscan_pcls.zip 720M

shapenet_pcls.zip 209M

Meta file name	size	Image file name	size
LAMM_3dinstruct_10k.json	19.6M	3rscan_pcls.zip	720M
		shapenet_pcls.zip	209M

Dataset Structure

└── 2D_Instruct  
│   ├── coco_images.zip  
│   ├── bamboo_images.zip  
│   ├── textvqa_images.zip  
│   ├── locount_images.zip  
│   └── meta_file  
│       ├── daily_dialogue_49k.json  
│       ├── detailed_description_49k.json  
│       ├── factual_knowledge_dialogue_42k.json  
│       └── vision_task_dialogue_46k.json  
└── 3D_Instruct  
    ├── 3rscan_pcls.zip  
    ├── shapenet_pcls.zip  
    └── meta_file  
        └── LAMM_3dinstruct_10k.json

Meta file format

For images

[
    {
    "id": "000000019028",  # image id
    "image": "coco_images/000000019028.jpg", # image path
    "conversations": [
        {
            "from": "human",  # instruction
            "value": "How is the kitchen in the image furnished?"
        },
        {
            "from": "gpt",  # response
            "value": "The kitchen in the image is furnished with white cabinets and white appliances. There is a dishwasher, a stove, and a sink. On the stove, a blue towel hangs on the handle. A cutting board is placed on the dishwasher. There are also additional elements like a bowl of apples on the counter and a beige rug on the floor."
        }
    ],
    "task_type": "conversation",  # task type
    "src_image": "coco2017" # original dataset
    },
    {
        ...
    }
]

For point cloud

[
    {
        "pcl": "shapenet_pcls/04256520_cb71cb7b36dbcb6f826fc8d57346a2e4_4096.npy",
        "conversations": [
                {
                    "from": "human",
                    "value": "What scenario does this point cloud belong to according to the model\u2019s prediction?"
                },
                {
                    "from": "gpt",
                    "value": "Through meticulous analysis, it becomes evident that the point cloud aligns with the characteristics of sofa,couch,lounge s       cenario."
                }
            ],
        "task_type": "classification3d",
        "src_dataset": "ShapeNet",
        "src_id": "04256520_cb71cb7b36dbcb6f826fc8d57346a2e4"
    },
    {
        ...
    }
]

Notes：

If you want to work with a specific subset of the LAMM dataset, you will need to download both the corresponding meta file and the image collection.
if you prefer to download the data from the official website yourself, you can still organize it in the same way as we have and run it successfully. For example, during the 2D instruction tuning stage, if you only want to run the daily_dialogue_49k.json file, you can download the COCO2017 dataset and organize it accordingly.

LAMM-Framework

Installation

Pre-requist Packages: gcc <= 7.5.0; nvcc >= 11.1

    conda create -n lamm python=3.10 -y
    conda activate lamm
    # Choose different version of torch according to your 
    conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.3 -c pytorch

Install required packages

    pip install -r requirements.txt

    # Optional; For 3D experiments ONLY
    cd src/model/EPCL/third_party/pointnet2/
    python setup.py install
    cd ../../utils/
    pip install cython
    python cython_compile.py build_ext --inplace

Download required NLTK data

    import nltk
    nltk.download('stopwords')
    nltk.download('punkt')
    nltk.download('wordnet')

Optional:

for training
- flash attention(v2)
  Install flash attention (v2) if you are tight in GPU memory. Please refer to flash attention's installation
  
  FlashAttention-2 currently supports Ampere, Ada, or Hopper GPUs (e.g., A100, RTX 3090, RTX 4090, H100).
- xformers
  Install xformers if you are tight in GPU memory and cannot use flash attention (e.g., using Nvidia v100). Please refer to xformers's installation
for inference
- lightllm
  Install lightllm to speed up inference and decrease the GPU memery usage to enable large batchsize.
```
git clone -b multimodal  https://github.com/ModelTC/lightllm.git
cd lightllm
python setup.py install
```

Data & Model Preparation for Training

Data

Follow Download to download and prepare the data for 2D and 3D tasks. Put downloaded data in ./data folder.
```
├── data
    ├── 2D_Instruct  
    ├── 3D_Instruct
```
Language Model: Vicuna

To prepare the pre-trained Vicuna model, please follow the instructions provided Here. Put the downloaded model in the ./model_zoo/vicuna_ckpt folder.
3D Encoder: EPCL

Download Pre-trained EPCL model to tokenize point cloud from Here. Put the downloaded models in the ./ckpt folder.

Training

2D Models Training

cd src
sh scripts/train_lamm2d.sh
or
sh scripts/train_lamm2d_slurm.sh       # for slurm

3D Models Training

cd src
sh scripts/train_lamm3d.sh
or
sh scripts/train_lamm3d_slurm.sh       # for slurm

You need to dive into scripts to change data path and other hyper-parameters.

For your reference, GPU memory consumption for different models are shown as follows

Model Size	Sample Num/GPU	GPU Memory
Vicuna_v0_7B	1	~30GB
Vicuna_v0_7B	2	~46GB
Vicuna_v0_13B	1	~53GB
Vicuna_v0_13B	2	~70GB

LAMM-Benchmark

LAMM-Benchmark evaluates 9 common image tasks, using a total of 11 datasets with over 62,439 samples, and 3 common point cloud tasks, by utilizing 3 datasets with over 12,788 data samples, while existing works only provide quantitative results on fine-tuning and evaluating specific datasets such as ScienceQA, and most works only conduct demonstration or user studies.

We are the very first attempt to establish a benchmark for MLLMs. We conducted a comprehensive benchmark to quantify the zero-shot and fine-tuning performance of existing multi-modal language models on various computer vision tasks and compare them against state-of-the-art methods of these tasks, including classification, object detection, pose estimation, visual question answering, facial classification, optical character recognition, object counting.
We also attempted two novel evaluation strategies designed explicitly for MLLMs. Specifically, as for text generation, we established a scoring logic based on the GPT API. As for tasks involving interactions between points and images, such as object detection and pose estimation, we proposed an object-locating evaluation method.

Data & Model Preparation for LAMM-Benchmark

Benchmark Data For Evaluation

2D_Benchmark data

Meta file name	size	Image file name	size
Caption_flickr30k.json	598K	flickr30k_images.zip	559M
Classification_CIFAR10.json	2.6M	cifar10_images.zip	8.9M
Counting_FSC147.json	7.3M	fsc147_images.zip	44M
Detection_VOC2012.json	6.4M	voc2012_images.zip	196M
Facial_Classification_CelebA(Hair).json	2.4M	celeba_images.zip	566M
Facial_Classification_CelebA(Smile).json	3.7M	celeba_images.zip	566M
Fine-grained_Classification_UCMerced.json	676K	ucmerced_images.zip	317M
Keypoints_Dectection_LSP.json	3.9M	fsc147_images.zip	44M
Locating_FSC147.json	7.5M	fsc147_images.zip	44M
Locating_LSP.json	3.9M	lsp_images.zip	9.9M
Locating_VOC2012.json	6.0M	voc2012_images.zip	196M
OCR_SVT.json	68K	svt_images.zip	82M
VQA_AI2D.json	2.1M	ai2d_images.zip	559M
VQA_SQAimage.json	3.6M	sqaimage_images.zip	127M

3D_Benchmark data

Meta file name	size	Image file name	size
Detection_ScanNet.json	1.7M	scannet_pcls.zip	246M
VG_ScanRefer.json	3.7M	scannet_pcls.zip	246M
VQA_ScanQA_multiplechoice.json	859K	scannet_pcls.zip	246M

Dataset Structure

├── 2D_Benchmark  
│   ├── ai2d_images.zip  
│   ├── celeba_images.zip  
│   ├── cifar10_images.zip  
│   ├── flickr30k_images.zip  
│   ├── fsc147_images.zip  
│   ├── lsp_images.zip  
│   ├── sqaimage_images.zip  
│   ├── svt_images.zip  
│   ├── ucmerced_images.zip  
│   ├── voc2012_images.zip  
│   └── meta_file  
│       ├── Caption_flickr30k.json  
│       ├── Classification_CIFAR10.json  
│       ├── Counting_FSC147.json  
│       ├── Detection_VOC2012.json  
│       ├── Facial_Classification_CelebA(Hair).json  
│       ├── Facial_Classification_CelebA(Smile).json  
│       ├── Fine-grained_Classification_UCMerced.json  
│       ├── Keypoints_Dectection_LSP.json  
│       ├── Locating_FSC147.json  
│       ├── Locating_LSP.json  
│       ├── Locating_VOC2012.json  
│       ├── OCR_SVT.json  
│       ├── VQA_AI2D.json  
│       └── VQA_SQAimage.json  
└── 3D_Benchmark  
    ├── scannet_pcls.zip  
    └── meta_file  
        ├── Detection_ScanNet.json  
        ├── VG_ScanRefer.json  
        └── VQA_ScanQA_multiplechoice.json

Model Preparation

Language Model: Vicuna

To prepare the pre-trained Vicuna model, please follow the instructions provided Here. Put the downloaded model in the ./model_zoo/vicuna_ckpt folder.
3D Encoder: EPCL

Download Pre-trained EPCL model to tokenize point cloud from Here. Put the downloaded models in the ./model_zoo/epcl_ckpt folder.
LAMM Models

Download LAMM model from Here. Put the downloaded models in the ./ckpt folder.

Or you can train your own LAMM model by following the instructions Here!
Other Models

Evaluation

Inference trained models on 2D tasks

cd src
sh scripts/inference_2D.sh

sh scripts/inference_2D_slurm.sh       # for slurm

Inference & Evaluation on 2D tasks

sh scripts/LAMM_2D_Evaluation.sh

sh scripts/LAMM_2D_Evaluation_slurm.sh  # for slurm

Inference trained models on 3D tasks

cd src
sh scripts/inference_3D.sh

sh scripts/inference_3D_slurm.sh       # for slurm

Inference & evaluation trained models on 3D tasks

sh scripts/LAMM_3D_Evaluation.sh

sh scripts/LAMM_3D_Evaluation_slurm.sh  # for slurm

Evaluation for other MLLM models.

Please refer to LLaVA, MiniGPT-4 and mPLUG-owl for inference respectively. Save the answers in ./answers. And then run common_eval_2d.py for evaluation. For example, to evaluate LLaVA on VOC2012:
```
python common_eval_2d.py \
--dataset-name VOC2012 \
--answer-file ./answers/LLaVA \
--base-data-path ./data/LAMM-Dataset/2D_Benchmark \
2>&1 | tee ./results/LLaVA/eval_VOC2012.log
```
GPT Metric

Make sure that you have finished the inference of all the evaluation dataset for both your model/LAMM model and the MLLM model to compare. For example, to rank LAMM and LLaVA:
```
sh scripts/GPT_metric.sh
```

You may need to dive into scripts to change datasets to evaluation & checkpoints folder to load.

Leaderboard

Results of LAMM model on selected 2D vision tasks

Task	Dataset	LAMM(Zero-Shot)	LAMM(Finetune)
Classification (Acc)	CIFAR10	37.90	91.2
Object Detection (Acc)	VOC2012	7.20	13.48
VQA (mAP@0.5)	SQAimage	49.88	74.27

Results of 3D tasks by LAMM

Task	Dataset	SOTA	LAMM (Zero-Shot)	LAMM (Finetune)
3D Object Detection (mAP@0.5)	ScanNet	63.2	8.2	11.89
Visual Grounding (mAP@0.5)	ScanRefer	54.59	Failed	3.38
3D VQA (Acc of multiple choice prolblem)	ScanQA	N/A	24.90	99.89

Comparison of results of Binary Locating Metric and GPT Metric of existing MLLMs

	LLaVA	MiniGPT4	mPLUG-owl	LAMM
Binary-Loc Metric	14.73	13.12	4.42	36.53
GPT Metric	11	-	-	89

Comparison of Multimodal Large Language Models on 2D computer vision tasks.

Bold fonts for the best results.

Task	Dataset	Metric	SOTA	LLaVA	MiniGPT4	mPLUG-owl	LAMM
Classification	CIFAR10	Acc ↑	99.5	60.83	46.22	42.5	37.9
Detection	VOC2012	mAP ↑	97.2	1.42	0.92	0.158	7.20
VQA	SQAimage AI2D	Acc ↑	92.53 N/A	40.5 18.13	43.43 Failed	36.39 19.31	49.88 20.92
Image Caption	flickr30k	BLEU4 ↑	30.1	6.65	5.1	2.74	2.56
F-g clasification	UCMerced	Acc ↑	100	47	33.6	32.5	18.23
Counting	FSC147	MAE ↓	10.79	56.2	Failed	60.67	46.88
OCR	SVT	Word Acc ↑	97.9	37.78	16.97	30.39	29.14
Facial Classification	CelebA(Smile) CelebA(Hair)	Acc ↑	N/A N/A	Failed 46.42	66.36 43.47	Failed 40.93	57.60 56.96
Keypoints Detection	LSP	PCK ↑	99.5	Failed	Failed	Failed	Failed

LAMM Model Zoo

# Training Samples	Vision Encoder	LLM	Training Data	Lora Rank	Link
98K	CLIP-ViT-L	Vicuna_v0_7B	LAMM-2D daily dialogue & desctiption	32	Checkpoints
186K	CLIP-ViT-L	Vicuna_v0_7B	LAMM-2D Instruction Data	32	Checkpoints
98K	CLIP-ViT-L	Vicuna_v0_13B	LAMM-2D daily dialogue & desctiption	32	Checkpoints
186K	CLIP-ViT-L	Vicuna_v0_13B	LAMM-2D Instruction Data	32	Checkpoints
10K	EPCL-ViT-L	Vicuna13B	LAMM-3D Instruction Data	32	Checkpoints

Citation

    @article{yin2023lamm,
        title={LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark},
        author={Yin, Zhenfei and Wang, Jiong and Cao, Jianjian and Shi, Zhelun and Liu, Dingning and Li, Mukai and Sheng, Lu and Bai, Lei and Huang, Xiaoshui and Wang, Zhiyong and others},
        journal={arXiv preprint arXiv:2306.06687},
        year={2023}
}

License

The project is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes. The checkpoints are also CC BY NC 4.0 (allowing only non-commercial use).

Acknowledgement

We thank Hongxing Fan, Zeren Chen, Zhen Wang for support of LAMM project.

We also thanks the great works including CLIP, EPCL, LLaMA, Vicuna, FlashAttention, xformers, lightllm

Lighten001/LAMM