/LAMM

Primary LanguagePython

LAMM

🌏 Project Page🤗 Demo▶️ YouTube 📺 Bilibili 📀 Data📊 Benchmark📦 LAMM Models

Updates

📆 Coming Soon

  1. Code for less GPU memory will be released soon. Please stay tuned.

📆 [2023-07-28]

  1. Checkpoints of LAMM on huggingface updated on new code base. LAMM Performance updated, Please check it out.

📆 [2023-07-06]

  1. Evaluation code for both 2D and 3D tasks are ready.
  2. 3D Benchmark meta files & 2D Instruction images files updated! Missing files and missing keys fixed. Please update accordingly.
  3. Update scripts for demo in command line.

📆 [2023-06-30]

  1. Watch demo video for LAMM at Youtube and Bilibili!

📆 [2023-06-20]

  1. Full paper with Appendix is online.

📆 [2023-06-16]

  1. LAMM dataset is available for Research community!

📆 [2023-06-12]

  1. GPT Evaluation part available.

  2. Our Paper will release tomorrow. Please stay tuned!

📆 [2023-06-11]

  1. LAMM code is available for Research community!

  2. Try out the Interactive Demo on Huggingface! (Time to build app depends on the server load)

Demos

Online Demo

For cases of 2D images, we provide an online demo deployed on huggingface spaces.

Due to limitation of hardware capacity, online version only supports LLM of 7B parameters and load pretrained model takes few minutes.

CLI Demo

We also provide a CLI demo for local test. Point cloud data are required to be in format of npy, we suggest to use data from LAMM-Benchmark-3D.

    cd ./src
    python cli_demo.py \
        --model lamm_peft \
        --vision_type pcl or image \
        --encoder_pretrain epcl or clip \
        --encoder_ckpt_path $EPCL_CKPT_PATH or '' \
        --vicuna_ckpt_path $LLM_CKPT_PATH \
        --delta_ckpt_path $LAMM_CKPT_PATH

Overview

Large language models have become a potential pathway toward achieving artificial general intelligence. Recent works on multi-modal large language models have demonstrated their effectiveness in handling visual modalities. In this work, we extend the research of MLLMs to point clouds and present the LAMM-Dataset and LAMM-Benchmark for 2D image and 3D point cloud understanding. We also establish an extensible framework to facilitate the extension of MLLMs to additional modalities. Our main contribution is three-fold: 1) We present the LAMM-Dataset and LAMM-Benchmark, which cover almost all high-level vision tasks for 2D and 3D vision. Extensive experiments validate the effectiveness of our dataset and benchmark. 2) We demonstrate the detailed methods of constructing instruction-tuning datasets and benchmarks for MLLMs, which will enable future research on MLLMs to scale up and extend to other domains, tasks, and modalities faster. 3) We provide a primary but potential MLLM training framework optimized for modalities' extension. We also provide baseline models, comprehensive experimental observations, and analysis to accelerate future research.

LAMM-Dataset

LAMM-Dataset is a comprehensive multi-modal instruction tuning dataset, which contains 186K language-image instruction-response pairs, and 10K lanuage-3D instruction-response pairs.In LAMM-Dataset, the instruction-response pairs are gathered from 8 image datasets and 4 point cloud datasets. Here we design four type of multi-modal instruction-response pairs,

  • C1: n-round daily dialogue focuses on multi-modal daily conversations.
  • C2: n-round factual knowledge dialogue aims at factual knowledge reasoning.
  • C3: 1-round detailed description aims to elaborate images and 3D scenes in texts.
  • C4: 1-round visual task dialogue transfers various vision tasks into instruction-response pairs, aiming at enhancing generalizability towards domain tasks in other modalities.

Download

Download LAMM-Dataset from here.

If you would like to download the entire LAMM Dataset and LAMM Benchmark, you can do so from the opendatalab website using the provided LAMM link. Here is the table illustrating the correspondence between each Meta file and image collection in the LAMM dataset:

Instruction Data For Training

Dataset Structure
└── 2D_Instruct  
│   ├── coco_images.zip  
│   ├── bamboo_images.zip  
│   ├── textvqa_images.zip  
│   ├── locount_images.zip  
│   └── meta_file  
│       ├── daily_dialogue_49k.json  
│       ├── detailed_description_49k.json  
│       ├── factual_knowledge_dialogue_42k.json  
│       └── vision_task_dialogue_46k.json  
└── 3D_Instruct  
    ├── 3rscan_pcls.zip  
    ├── shapenet_pcls.zip  
    └── meta_file  
        └── LAMM_3dinstruct_10k.json  

Meta file format
  • For images
[
    {
    "id": "000000019028",  # image id
    "image": "coco_images/000000019028.jpg", # image path
    "conversations": [
        {
            "from": "human",  # instruction
            "value": "How is the kitchen in the image furnished?"
        },
        {
            "from": "gpt",  # response
            "value": "The kitchen in the image is furnished with white cabinets and white appliances. There is a dishwasher, a stove, and a sink. On the stove, a blue towel hangs on the handle. A cutting board is placed on the dishwasher. There are also additional elements like a bowl of apples on the counter and a beige rug on the floor."
        }
    ],
    "task_type": "conversation",  # task type
    "src_image": "coco2017" # original dataset
    },
    {
        ...
    }
]
  • For point cloud
[
    {
        "pcl": "shapenet_pcls/04256520_cb71cb7b36dbcb6f826fc8d57346a2e4_4096.npy",
        "conversations": [
                {
                    "from": "human",
                    "value": "What scenario does this point cloud belong to according to the model\u2019s prediction?"
                },
                {
                    "from": "gpt",
                    "value": "Through meticulous analysis, it becomes evident that the point cloud aligns with the characteristics of sofa,couch,lounge s       cenario."
                }
            ],
        "task_type": "classification3d",
        "src_dataset": "ShapeNet",
        "src_id": "04256520_cb71cb7b36dbcb6f826fc8d57346a2e4"
    },
    {
        ...
    }
]

Notes

  1. If you want to work with a specific subset of the LAMM dataset, you will need to download both the corresponding meta file and the image collection.
  2. if you prefer to download the data from the official website yourself, you can still organize it in the same way as we have and run it successfully. For example, during the 2D instruction tuning stage, if you only want to run the daily_dialogue_49k.json file, you can download the COCO2017 dataset and organize it accordingly.

LAMM-Framework

Installation

Pre-requist Packages: gcc <= 7.5.0; nvcc >= 11.1

    conda create -n lamm python=3.10 -y
    conda activate lamm
    # Choose different version of torch according to your 
    conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.3 -c pytorch

Install required packages

    pip install -r requirements.txt

    # Optional; For 3D experiments ONLY
    cd src/model/EPCL/third_party/pointnet2/
    python setup.py install
    cd ../../utils/
    pip install cython
    python cython_compile.py build_ext --inplace

Download required NLTK data

    import nltk
    nltk.download('stopwords')
    nltk.download('punkt')
    nltk.download('wordnet')

Optional:

  • for training

    • flash attention(v2)
      Install flash attention (v2) if you are tight in GPU memory. Please refer to flash attention's installation

      FlashAttention-2 currently supports Ampere, Ada, or Hopper GPUs (e.g., A100, RTX 3090, RTX 4090, H100).

    • xformers
      Install xformers if you are tight in GPU memory and cannot use flash attention (e.g., using Nvidia v100). Please refer to xformers's installation

  • for inference

    • lightllm
      Install lightllm to speed up inference and decrease the GPU memery usage to enable large batchsize.
      git clone -b multimodal  https://github.com/ModelTC/lightllm.git
      cd lightllm
      python setup.py install
      

Data & Model Preparation for Training

  • Data

    Follow Download to download and prepare the data for 2D and 3D tasks. Put downloaded data in ./data folder.

    ├── data
        ├── 2D_Instruct  
        ├── 3D_Instruct
    
  • Language Model: Vicuna

    To prepare the pre-trained Vicuna model, please follow the instructions provided Here. Put the downloaded model in the ./model_zoo/vicuna_ckpt folder.

  • 3D Encoder: EPCL

    Download Pre-trained EPCL model to tokenize point cloud from Here. Put the downloaded models in the ./ckpt folder.

Training

  • 2D Models Training
    cd src
    sh scripts/train_lamm2d.sh
    or
    sh scripts/train_lamm2d_slurm.sh       # for slurm
  • 3D Models Training
    cd src
    sh scripts/train_lamm3d.sh
    or
    sh scripts/train_lamm3d_slurm.sh       # for slurm

You need to dive into scripts to change data path and other hyper-parameters.

For your reference, GPU memory consumption for different models are shown as follows

Model Size Sample Num/GPU GPU Memory
Vicuna_v0_7B 1 ~30GB
Vicuna_v0_7B 2 ~46GB
Vicuna_v0_13B 1 ~53GB
Vicuna_v0_13B 2 ~70GB

LAMM-Benchmark

LAMM-Benchmark evaluates 9 common image tasks, using a total of 11 datasets with over 62,439 samples, and 3 common point cloud tasks, by utilizing 3 datasets with over 12,788 data samples, while existing works only provide quantitative results on fine-tuning and evaluating specific datasets such as ScienceQA, and most works only conduct demonstration or user studies.

  • We are the very first attempt to establish a benchmark for MLLMs. We conducted a comprehensive benchmark to quantify the zero-shot and fine-tuning performance of existing multi-modal language models on various computer vision tasks and compare them against state-of-the-art methods of these tasks, including classification, object detection, pose estimation, visual question answering, facial classification, optical character recognition, object counting.
  • We also attempted two novel evaluation strategies designed explicitly for MLLMs. Specifically, as for text generation, we established a scoring logic based on the GPT API. As for tasks involving interactions between points and images, such as object detection and pose estimation, we proposed an object-locating evaluation method.

Data & Model Preparation for LAMM-Benchmark

Benchmark Data For Evaluation

Dataset Structure
├── 2D_Benchmark  
│   ├── ai2d_images.zip  
│   ├── celeba_images.zip  
│   ├── cifar10_images.zip  
│   ├── flickr30k_images.zip  
│   ├── fsc147_images.zip  
│   ├── lsp_images.zip  
│   ├── sqaimage_images.zip  
│   ├── svt_images.zip  
│   ├── ucmerced_images.zip  
│   ├── voc2012_images.zip  
│   └── meta_file  
│       ├── Caption_flickr30k.json  
│       ├── Classification_CIFAR10.json  
│       ├── Counting_FSC147.json  
│       ├── Detection_VOC2012.json  
│       ├── Facial_Classification_CelebA(Hair).json  
│       ├── Facial_Classification_CelebA(Smile).json  
│       ├── Fine-grained_Classification_UCMerced.json  
│       ├── Keypoints_Dectection_LSP.json  
│       ├── Locating_FSC147.json  
│       ├── Locating_LSP.json  
│       ├── Locating_VOC2012.json  
│       ├── OCR_SVT.json  
│       ├── VQA_AI2D.json  
│       └── VQA_SQAimage.json  
└── 3D_Benchmark  
    ├── scannet_pcls.zip  
    └── meta_file  
        ├── Detection_ScanNet.json  
        ├── VG_ScanRefer.json  
        └── VQA_ScanQA_multiplechoice.json
Model Preparation
  • Language Model: Vicuna

    To prepare the pre-trained Vicuna model, please follow the instructions provided Here. Put the downloaded model in the ./model_zoo/vicuna_ckpt folder.

  • 3D Encoder: EPCL

    Download Pre-trained EPCL model to tokenize point cloud from Here. Put the downloaded models in the ./model_zoo/epcl_ckpt folder.

  • LAMM Models

    Download LAMM model from Here. Put the downloaded models in the ./ckpt folder.

    Or you can train your own LAMM model by following the instructions Here!

  • Other Models

Evaluation

  • Inference trained models on 2D tasks
    cd src
    sh scripts/inference_2D.sh

    or

    sh scripts/inference_2D_slurm.sh       # for slurm
  • Inference & Evaluation on 2D tasks
    sh scripts/LAMM_2D_Evaluation.sh

    or

    sh scripts/LAMM_2D_Evaluation_slurm.sh  # for slurm
  • Inference trained models on 3D tasks
    cd src
    sh scripts/inference_3D.sh

    or

    sh scripts/inference_3D_slurm.sh       # for slurm
  • Inference & evaluation trained models on 3D tasks
    sh scripts/LAMM_3D_Evaluation.sh

    or

    sh scripts/LAMM_3D_Evaluation_slurm.sh  # for slurm
  • Evaluation for other MLLM models.

    Please refer to LLaVA, MiniGPT-4 and mPLUG-owl for inference respectively. Save the answers in ./answers. And then run common_eval_2d.py for evaluation. For example, to evaluate LLaVA on VOC2012:

    python common_eval_2d.py \
    --dataset-name VOC2012 \
    --answer-file ./answers/LLaVA \
    --base-data-path ./data/LAMM-Dataset/2D_Benchmark \
    2>&1 | tee ./results/LLaVA/eval_VOC2012.log
  • GPT Metric

    Make sure that you have finished the inference of all the evaluation dataset for both your model/LAMM model and the MLLM model to compare. For example, to rank LAMM and LLaVA:

    sh scripts/GPT_metric.sh

You may need to dive into scripts to change datasets to evaluation & checkpoints folder to load.

Leaderboard

Results of LAMM model on selected 2D vision tasks

Task Dataset LAMM(Zero-Shot) LAMM(Finetune)
Classification (Acc) CIFAR10 37.90 91.2
Object Detection (Acc) VOC2012 7.20 13.48
VQA (mAP@0.5) SQAimage 49.88 74.27

Results of 3D tasks by LAMM

Task Dataset SOTA LAMM (Zero-Shot) LAMM (Finetune)
3D Object Detection (mAP@0.5) ScanNet 63.2 8.2 11.89
Visual Grounding (mAP@0.5) ScanRefer 54.59 Failed 3.38
3D VQA (Acc of multiple choice prolblem) ScanQA N/A 24.90 99.89

Comparison of results of Binary Locating Metric and GPT Metric of existing MLLMs

LLaVA MiniGPT4 mPLUG-owl LAMM
Binary-Loc Metric 14.73 13.12 4.42 36.53
GPT Metric 11 - - 89

Comparison of Multimodal Large Language Models on 2D computer vision tasks.

Bold fonts for the best results.

Task Dataset Metric SOTA LLaVA MiniGPT4 mPLUG-owl LAMM
Classification CIFAR10 Acc ↑ 99.5 60.83 46.22 42.5 37.9
Detection VOC2012 mAP ↑ 97.2 1.42 0.92 0.158 7.20
VQA SQAimage
AI2D
Acc ↑ 92.53
N/A
40.5
18.13
43.43
Failed
36.39
19.31
49.88
20.92
Image Caption flickr30k BLEU4 ↑ 30.1 6.65 5.1 2.74 2.56
F-g clasification UCMerced Acc ↑ 100 47 33.6 32.5 18.23
Counting FSC147 MAE ↓ 10.79 56.2 Failed 60.67 46.88
OCR SVT Word Acc ↑ 97.9 37.78 16.97 30.39 29.14
Facial Classification CelebA(Smile)
CelebA(Hair)
Acc ↑ N/A
N/A
Failed
46.42
66.36
43.47
Failed
40.93
57.60
56.96
Keypoints Detection LSP PCK ↑ 99.5 Failed Failed Failed Failed

LAMM Model Zoo

# Training Samples Vision Encoder LLM Training Data Lora Rank Link
98K CLIP-ViT-L Vicuna_v0_7B LAMM-2D daily dialogue & desctiption 32 Checkpoints
186K CLIP-ViT-L Vicuna_v0_7B LAMM-2D Instruction Data 32 Checkpoints
98K CLIP-ViT-L Vicuna_v0_13B LAMM-2D daily dialogue & desctiption 32 Checkpoints
186K CLIP-ViT-L Vicuna_v0_13B LAMM-2D Instruction Data 32 Checkpoints
10K EPCL-ViT-L Vicuna13B LAMM-3D Instruction Data 32 Checkpoints

Citation

    @article{yin2023lamm,
        title={LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark},
        author={Yin, Zhenfei and Wang, Jiong and Cao, Jianjian and Shi, Zhelun and Liu, Dingning and Li, Mukai and Sheng, Lu and Bai, Lei and Huang, Xiaoshui and Wang, Zhiyong and others},
        journal={arXiv preprint arXiv:2306.06687},
        year={2023}
}

License

The project is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes. The checkpoints are also CC BY NC 4.0 (allowing only non-commercial use).


Acknowledgement

We thank Hongxing Fan, Zeren Chen, Zhen Wang for support of LAMM project.

We also thanks the great works including CLIP, EPCL, LLaMA, Vicuna, FlashAttention, xformers, lightllm