🌏 Project Page • 🤗 Demo •
Official Repository of LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark
📆 Coming Soon
- Code for less GPU memory will be released soon. Please stay tuned.
📆 [2023-07-28]
- Checkpoints of LAMM on huggingface updated on new code base. LAMM Performance updated, Please check it out.
📆 [2023-07-06]
- Evaluation code for both 2D and 3D tasks are ready.
- 3D Benchmark meta files & 2D Instruction images files updated! Missing files and missing keys fixed. Please update accordingly.
- Update scripts for demo in command line.
📆 [2023-06-30]
📆 [2023-06-20]
- Full paper with Appendix is online.
📆 [2023-06-16]
- LAMM dataset is available for Research community!
📆 [2023-06-12]
-
GPT Evaluation part available.
-
Our Paper will release tomorrow. Please stay tuned!
📆 [2023-06-11]
-
LAMM code is available for Research community!
-
Try out the Interactive Demo on Huggingface! (Time to build app depends on the server load)
For cases of 2D images, we provide an online demo deployed on huggingface spaces.
Due to limitation of hardware capacity, online version only supports LLM of 7B parameters and load pretrained model takes few minutes.
We also provide a CLI demo for local test.
Point cloud data are required to be in format of npy
, we suggest to use data from LAMM-Benchmark-3D.
cd ./src
python cli_demo.py \
--model lamm_peft \
--vision_type pcl or image \
--encoder_pretrain epcl or clip \
--encoder_ckpt_path $EPCL_CKPT_PATH or '' \
--vicuna_ckpt_path $LLM_CKPT_PATH \
--delta_ckpt_path $LAMM_CKPT_PATH
Large language models have become a potential pathway toward achieving artificial general intelligence. Recent works on multi-modal large language models have demonstrated their effectiveness in handling visual modalities. In this work, we extend the research of MLLMs to point clouds and present the LAMM-Dataset and LAMM-Benchmark for 2D image and 3D point cloud understanding. We also establish an extensible framework to facilitate the extension of MLLMs to additional modalities. Our main contribution is three-fold: 1) We present the LAMM-Dataset and LAMM-Benchmark, which cover almost all high-level vision tasks for 2D and 3D vision. Extensive experiments validate the effectiveness of our dataset and benchmark. 2) We demonstrate the detailed methods of constructing instruction-tuning datasets and benchmarks for MLLMs, which will enable future research on MLLMs to scale up and extend to other domains, tasks, and modalities faster. 3) We provide a primary but potential MLLM training framework optimized for modalities' extension. We also provide baseline models, comprehensive experimental observations, and analysis to accelerate future research.
LAMM-Dataset is a comprehensive multi-modal instruction tuning dataset, which contains 186K language-image instruction-response pairs, and 10K lanuage-3D instruction-response pairs.In LAMM-Dataset, the instruction-response pairs are gathered from 8 image datasets and 4 point cloud datasets. Here we design four type of multi-modal instruction-response pairs,
- C1: n-round daily dialogue focuses on multi-modal daily conversations.
- C2: n-round factual knowledge dialogue aims at factual knowledge reasoning.
- C3: 1-round detailed description aims to elaborate images and 3D scenes in texts.
- C4: 1-round visual task dialogue transfers various vision tasks into instruction-response pairs, aiming at enhancing generalizability towards domain tasks in other modalities.
Download LAMM-Dataset from here.
If you would like to download the entire LAMM Dataset and LAMM Benchmark, you can do so from the opendatalab website using the provided LAMM link. Here is the table illustrating the correspondence between each Meta file and image collection in the LAMM dataset:
Instruction Data For Training
-
2D_Instruct data
Meta file name size Image file name size daily_dialogue_49k.json 112M coco_images.zip 7.8G detailed_description_49k.json 65.5M coco_images.zip 7.8G factual_knowledge_dialogue_42k.json 83.2M bamboo_images.zip 5.4G vision_task_dialogue_46k.json 64.8M coco_images.zip, bamboo_images.zip, locount_images.zip, textvqa_images.zip 9.2G -
3D_Instruct data
Meta file name size Image file name size LAMM_3dinstruct_10k.json 19.6M 3rscan_pcls.zip 720M shapenet_pcls.zip 209M
Dataset Structure
└── 2D_Instruct
│ ├── coco_images.zip
│ ├── bamboo_images.zip
│ ├── textvqa_images.zip
│ ├── locount_images.zip
│ └── meta_file
│ ├── daily_dialogue_49k.json
│ ├── detailed_description_49k.json
│ ├── factual_knowledge_dialogue_42k.json
│ └── vision_task_dialogue_46k.json
└── 3D_Instruct
├── 3rscan_pcls.zip
├── shapenet_pcls.zip
└── meta_file
└── LAMM_3dinstruct_10k.json
Meta file format
- For images
[
{
"id": "000000019028", # image id
"image": "coco_images/000000019028.jpg", # image path
"conversations": [
{
"from": "human", # instruction
"value": "How is the kitchen in the image furnished?"
},
{
"from": "gpt", # response
"value": "The kitchen in the image is furnished with white cabinets and white appliances. There is a dishwasher, a stove, and a sink. On the stove, a blue towel hangs on the handle. A cutting board is placed on the dishwasher. There are also additional elements like a bowl of apples on the counter and a beige rug on the floor."
}
],
"task_type": "conversation", # task type
"src_image": "coco2017" # original dataset
},
{
...
}
]
- For point cloud
[
{
"pcl": "shapenet_pcls/04256520_cb71cb7b36dbcb6f826fc8d57346a2e4_4096.npy",
"conversations": [
{
"from": "human",
"value": "What scenario does this point cloud belong to according to the model\u2019s prediction?"
},
{
"from": "gpt",
"value": "Through meticulous analysis, it becomes evident that the point cloud aligns with the characteristics of sofa,couch,lounge s cenario."
}
],
"task_type": "classification3d",
"src_dataset": "ShapeNet",
"src_id": "04256520_cb71cb7b36dbcb6f826fc8d57346a2e4"
},
{
...
}
]
Notes:
- If you want to work with a specific subset of the LAMM dataset, you will need to download both the corresponding meta file and the image collection.
- if you prefer to download the data from the official website yourself, you can still organize it in the same way as we have and run it successfully. For example, during the 2D instruction tuning stage, if you only want to run the daily_dialogue_49k.json file, you can download the COCO2017 dataset and organize it accordingly.
Pre-requist Packages: gcc <= 7.5.0; nvcc >= 11.1
conda create -n lamm python=3.10 -y
conda activate lamm
# Choose different version of torch according to your
conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.3 -c pytorch
Install required packages
pip install -r requirements.txt
# Optional; For 3D experiments ONLY
cd src/model/EPCL/third_party/pointnet2/
python setup.py install
cd ../../utils/
pip install cython
python cython_compile.py build_ext --inplace
Download required NLTK data
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
Optional:
-
for training
-
flash attention(v2)
Install flash attention (v2) if you are tight in GPU memory. Please refer to flash attention's installationFlashAttention-2 currently supports Ampere, Ada, or Hopper GPUs (e.g., A100, RTX 3090, RTX 4090, H100).
-
xformers
Install xformers if you are tight in GPU memory and cannot use flash attention (e.g., using Nvidia v100). Please refer to xformers's installation
-
-
for inference
- lightllm
Install lightllm to speed up inference and decrease the GPU memery usage to enable large batchsize.git clone -b multimodal https://github.com/ModelTC/lightllm.git cd lightllm python setup.py install
- lightllm
-
Data
Follow Download to download and prepare the data for 2D and 3D tasks. Put downloaded data in
./data
folder.├── data ├── 2D_Instruct ├── 3D_Instruct
-
Language Model: Vicuna
To prepare the pre-trained Vicuna model, please follow the instructions provided Here. Put the downloaded model in the
./model_zoo/vicuna_ckpt
folder. -
3D Encoder: EPCL
Download Pre-trained EPCL model to tokenize point cloud from Here. Put the downloaded models in the
./ckpt
folder.
- 2D Models Training
cd src sh scripts/train_lamm2d.sh or sh scripts/train_lamm2d_slurm.sh # for slurm
- 3D Models Training
cd src sh scripts/train_lamm3d.sh or sh scripts/train_lamm3d_slurm.sh # for slurm
You need to dive into scripts to change data path and other hyper-parameters.
For your reference, GPU memory consumption for different models are shown as follows
Model Size | Sample Num/GPU | GPU Memory |
---|---|---|
Vicuna_v0_7B | 1 | ~30GB |
Vicuna_v0_7B | 2 | ~46GB |
Vicuna_v0_13B | 1 | ~53GB |
Vicuna_v0_13B | 2 | ~70GB |
LAMM-Benchmark evaluates 9 common image tasks, using a total of 11 datasets with over 62,439 samples, and 3 common point cloud tasks, by utilizing 3 datasets with over 12,788 data samples, while existing works only provide quantitative results on fine-tuning and evaluating specific datasets such as ScienceQA, and most works only conduct demonstration or user studies.
- We are the very first attempt to establish a benchmark for MLLMs. We conducted a comprehensive benchmark to quantify the zero-shot and fine-tuning performance of existing multi-modal language models on various computer vision tasks and compare them against state-of-the-art methods of these tasks, including classification, object detection, pose estimation, visual question answering, facial classification, optical character recognition, object counting.
- We also attempted two novel evaluation strategies designed explicitly for MLLMs. Specifically, as for text generation, we established a scoring logic based on the GPT API. As for tasks involving interactions between points and images, such as object detection and pose estimation, we proposed an object-locating evaluation method.
Benchmark Data For Evaluation
-
2D_Benchmark data
Meta file name size Image file name size Caption_flickr30k.json 598K flickr30k_images.zip 559M Classification_CIFAR10.json 2.6M cifar10_images.zip 8.9M Counting_FSC147.json 7.3M fsc147_images.zip 44M Detection_VOC2012.json 6.4M voc2012_images.zip 196M Facial_Classification_CelebA(Hair).json 2.4M celeba_images.zip 566M Facial_Classification_CelebA(Smile).json 3.7M celeba_images.zip 566M Fine-grained_Classification_UCMerced.json 676K ucmerced_images.zip 317M Keypoints_Dectection_LSP.json 3.9M fsc147_images.zip 44M Locating_FSC147.json 7.5M fsc147_images.zip 44M Locating_LSP.json 3.9M lsp_images.zip 9.9M Locating_VOC2012.json 6.0M voc2012_images.zip 196M OCR_SVT.json 68K svt_images.zip 82M VQA_AI2D.json 2.1M ai2d_images.zip 559M VQA_SQAimage.json 3.6M sqaimage_images.zip 127M -
3D_Benchmark data
Meta file name size Image file name size Detection_ScanNet.json 1.7M scannet_pcls.zip 246M VG_ScanRefer.json 3.7M scannet_pcls.zip 246M VQA_ScanQA_multiplechoice.json 859K scannet_pcls.zip 246M
Dataset Structure
├── 2D_Benchmark
│ ├── ai2d_images.zip
│ ├── celeba_images.zip
│ ├── cifar10_images.zip
│ ├── flickr30k_images.zip
│ ├── fsc147_images.zip
│ ├── lsp_images.zip
│ ├── sqaimage_images.zip
│ ├── svt_images.zip
│ ├── ucmerced_images.zip
│ ├── voc2012_images.zip
│ └── meta_file
│ ├── Caption_flickr30k.json
│ ├── Classification_CIFAR10.json
│ ├── Counting_FSC147.json
│ ├── Detection_VOC2012.json
│ ├── Facial_Classification_CelebA(Hair).json
│ ├── Facial_Classification_CelebA(Smile).json
│ ├── Fine-grained_Classification_UCMerced.json
│ ├── Keypoints_Dectection_LSP.json
│ ├── Locating_FSC147.json
│ ├── Locating_LSP.json
│ ├── Locating_VOC2012.json
│ ├── OCR_SVT.json
│ ├── VQA_AI2D.json
│ └── VQA_SQAimage.json
└── 3D_Benchmark
├── scannet_pcls.zip
└── meta_file
├── Detection_ScanNet.json
├── VG_ScanRefer.json
└── VQA_ScanQA_multiplechoice.json
Model Preparation
-
Language Model: Vicuna
To prepare the pre-trained Vicuna model, please follow the instructions provided Here. Put the downloaded model in the
./model_zoo/vicuna_ckpt
folder. -
3D Encoder: EPCL
Download Pre-trained EPCL model to tokenize point cloud from Here. Put the downloaded models in the
./model_zoo/epcl_ckpt
folder. -
LAMM Models
Download LAMM model from Here. Put the downloaded models in the
./ckpt
folder.Or you can train your own LAMM model by following the instructions Here!
-
Other Models
-
Inference trained models on 2D tasks
cd src sh scripts/inference_2D.sh
or
sh scripts/inference_2D_slurm.sh # for slurm
-
Inference & Evaluation on 2D tasks
sh scripts/LAMM_2D_Evaluation.sh
or
sh scripts/LAMM_2D_Evaluation_slurm.sh # for slurm
-
Inference trained models on 3D tasks
cd src sh scripts/inference_3D.sh
or
sh scripts/inference_3D_slurm.sh # for slurm
-
Inference & evaluation trained models on 3D tasks
sh scripts/LAMM_3D_Evaluation.sh
or
sh scripts/LAMM_3D_Evaluation_slurm.sh # for slurm
-
Evaluation for other MLLM models.
Please refer to LLaVA, MiniGPT-4 and mPLUG-owl for inference respectively. Save the answers in
./answers
. And then runcommon_eval_2d.py
for evaluation. For example, to evaluate LLaVA on VOC2012:python common_eval_2d.py \ --dataset-name VOC2012 \ --answer-file ./answers/LLaVA \ --base-data-path ./data/LAMM-Dataset/2D_Benchmark \ 2>&1 | tee ./results/LLaVA/eval_VOC2012.log
-
GPT Metric
Make sure that you have finished the inference of all the evaluation dataset for both your model/LAMM model and the MLLM model to compare. For example, to rank LAMM and LLaVA:
sh scripts/GPT_metric.sh
You may need to dive into scripts to change datasets to evaluation & checkpoints folder to load.
Results of LAMM model on selected 2D vision tasks
Task | Dataset | LAMM(Zero-Shot) | LAMM(Finetune) |
---|---|---|---|
Classification (Acc) | CIFAR10 | 37.90 | 91.2 |
Object Detection (Acc) | VOC2012 | 7.20 | 13.48 |
VQA (mAP@0.5) | SQAimage | 49.88 | 74.27 |
Results of 3D tasks by LAMM
Task | Dataset | SOTA | LAMM (Zero-Shot) | LAMM (Finetune) |
---|---|---|---|---|
3D Object Detection (mAP@0.5) | ScanNet | 63.2 | 8.2 | 11.89 |
Visual Grounding (mAP@0.5) | ScanRefer | 54.59 | Failed | 3.38 |
3D VQA (Acc of multiple choice prolblem) | ScanQA | N/A | 24.90 | 99.89 |
Comparison of results of Binary Locating Metric and GPT Metric of existing MLLMs
LLaVA | MiniGPT4 | mPLUG-owl | LAMM | |
---|---|---|---|---|
Binary-Loc Metric | 14.73 | 13.12 | 4.42 | 36.53 |
GPT Metric | 11 | - | - | 89 |
Comparison of Multimodal Large Language Models on 2D computer vision tasks.
Bold fonts for the best results.
Task | Dataset | Metric | SOTA | LLaVA | MiniGPT4 | mPLUG-owl | LAMM |
---|---|---|---|---|---|---|---|
Classification | CIFAR10 | Acc ↑ | 99.5 | 60.83 | 46.22 | 42.5 | 37.9 |
Detection | VOC2012 | mAP ↑ | 97.2 | 1.42 | 0.92 | 0.158 | 7.20 |
VQA | SQAimage AI2D |
Acc ↑ | 92.53 N/A |
40.5 18.13 |
43.43 Failed |
36.39 19.31 |
49.88 20.92 |
Image Caption | flickr30k | BLEU4 ↑ | 30.1 | 6.65 | 5.1 | 2.74 | 2.56 |
F-g clasification | UCMerced | Acc ↑ | 100 | 47 | 33.6 | 32.5 | 18.23 |
Counting | FSC147 | MAE ↓ | 10.79 | 56.2 | Failed | 60.67 | 46.88 |
OCR | SVT | Word Acc ↑ | 97.9 | 37.78 | 16.97 | 30.39 | 29.14 |
Facial Classification | CelebA(Smile) CelebA(Hair) |
Acc ↑ | N/A N/A |
Failed 46.42 |
66.36 43.47 |
Failed 40.93 |
57.60 56.96 |
Keypoints Detection | LSP | PCK ↑ | 99.5 | Failed | Failed | Failed | Failed |
# Training Samples | Vision Encoder | LLM | Training Data | Lora Rank | Link |
---|---|---|---|---|---|
98K | CLIP-ViT-L | Vicuna_v0_7B | LAMM-2D daily dialogue & desctiption | 32 | Checkpoints |
186K | CLIP-ViT-L | Vicuna_v0_7B | LAMM-2D Instruction Data | 32 | Checkpoints |
98K | CLIP-ViT-L | Vicuna_v0_13B | LAMM-2D daily dialogue & desctiption | 32 | Checkpoints |
186K | CLIP-ViT-L | Vicuna_v0_13B | LAMM-2D Instruction Data | 32 | Checkpoints |
10K | EPCL-ViT-L | Vicuna13B | LAMM-3D Instruction Data | 32 | Checkpoints |
@article{yin2023lamm,
title={LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark},
author={Yin, Zhenfei and Wang, Jiong and Cao, Jianjian and Shi, Zhelun and Liu, Dingning and Li, Mukai and Sheng, Lu and Bai, Lei and Huang, Xiaoshui and Wang, Zhiyong and others},
journal={arXiv preprint arXiv:2306.06687},
year={2023}
}
The project is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes. The checkpoints are also CC BY NC 4.0 (allowing only non-commercial use).
We thank Hongxing Fan, Zeren Chen, Zhen Wang for support of LAMM project.
We also thanks the great works including CLIP, EPCL, LLaMA, Vicuna, FlashAttention, xformers, lightllm