/360VL

Primary LanguagePythonApache License 2.0Apache-2.0

360VL

360VL is developed based on the LLama3 language model and is also the industry's first open source large multi-modal model based on LLama3-70B[🤗Meta-Llama-3-70B-Instruct]. In addition to applying the Llama3 language model, the 360VL model also designs a globally aware multi-branch projector architecture, which enables the model to have more sufficient image understanding capabilities.

Contents

Install

  1. Clone this repository and navigate to 360VL folder
git clone https://github.com/360CVGroup/360VL.git
cd 360VL
  1. Install Package
conda create -n qh360_vl python=3.10 -y
conda activate qh360_vl
bash deploy.sh

Model Zoo

Model Checkpoints MMBT MMBD MMB-CNT MMB-CND MMMUV MMMUT MME
QWen-VL-Chat 🤗LINK 61.8 60.6 56.3 56.7 37 32.9 1860
mPLUG-Owl2 🤖LINK 66.0 66.5 60.3 59.5 34.7 32.1 1786.4
CogVLM 🤗LINK 65.8 63.7 55.9 53.8 37.3 30.1 1736.6
Monkey-Chat 🤗LINK 72.4 71 67.5 65.8 40.7 - 1887.4
MM1-7B-Chat LINK - 72.3 - - 37.0 35.6 1858.2
IDEFICS2-8B 🤗LINK 75.7 75.3 68.6 67.3 43.0 37.7 1847.6
SVIT-v1.5-13B 🤗LINK 69.1 - 63.1 - 38.0 33.3 1889
LLaVA-v1.5-13B 🤗LINK 69.2 69.2 65 63.6 36.4 33.6 1826.7
LLaVA-v1.6-13B 🤗LINK 70 70.7 68.5 64.3 36.2 - 1901
Honeybee LINK 73.6 74.3 - - 36.2 - 1976.5
YI-VL-34B 🤗LINK 72.4 71.1 70.7 71.4 45.1 41.6 2050.2
360VL-8B 🤗LINK 75.3 73.7 71.1 68.6 39.7 37.1 1944.6
360VL-70B 🤗LINK 78.1 80.4 76.9 77.7 50.8 44.3 2012.3

Quick Start 🤗

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
from PIL import Image

checkpoint = "qihoo360/360VL-70B"

model = AutoModelForCausalLM.from_pretrained(checkpoint, torch_dtype=torch.float16, device_map='auto', trust_remote_code=True).eval()
tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True)
vision_tower = model.get_vision_tower()
vision_tower.load_model()
vision_tower.to(device="cuda", dtype=torch.float16)
image_processor = vision_tower.image_processor
tokenizer.pad_token = tokenizer.eos_token


image = Image.open("docs/008.jpg").convert('RGB')
query = "Who is this cartoon character?"
terminators = [
    tokenizer.convert_tokens_to_ids("<|eot_id|>",)
]

inputs = model.build_conversation_input_ids(tokenizer, query=query, image=image, image_processor=image_processor)

input_ids = inputs["input_ids"].to(device='cuda', non_blocking=True)
images = inputs["image"].to(dtype=torch.float16, device='cuda', non_blocking=True)

output_ids = model.generate(
    input_ids,
    images=images,
    do_sample=False,
    eos_token_id=terminators,
    num_beams=1,
    max_new_tokens=512,
    use_cache=True)

input_token_len = input_ids.shape[1]
outputs = tokenizer.batch_decode(output_ids[:, input_token_len:], skip_special_tokens=True)[0]
outputs = outputs.strip()
print(outputs)

Demo

To run our demo, you need to download the weights of 360VL🤗LINK and the weights of CLIP-ViT-336🤗LINK

Gradio Web UI

To launch a Gradio demo locally, please run the following commands one by one. If you plan to launch multiple model workers to compare between different checkpoints, you only need to launch the controller and the web server ONCE.

Launch a controller

python -m qh360_vl.serve.controller --host 0.0.0.0 --port 10000

Launch a gradio web server.

python -m qh360_vl.serve.gradio_web_server --controller http://localhost:10000 --model-list-mode reload

You just launched the Gradio web interface. Now, you can open the web interface with the URL printed on the screen. You may notice that there is no model in the model list. Do not worry, as we have not launched any model worker yet. It will be automatically updated when you launch a model worker.

Launch a model worker

This is the actual worker that performs the inference on the GPU. Each worker is responsible for a single model specified in --model-path.

Note that the 8B model supports single-card inference, but the 70B model requires 8-card inference.

CUDA_VISIBLE_DEVICES=0 python -m qh360_vl.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path qihoo360/360VL-8B
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m qh360_vl.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path qihoo360/360VL-70B

CLI Inference

Chat about images using 360VL without the need of Gradio interface.

INIT_MODEL_PATH="/hbox2dir"
name="360VL-8B"
python -m qh360_vl.eval.infer \
    --model-path $INIT_MODEL_PATH/$name \

Download Llama3 checkpoints (Non-essential)

360VL is developed based on Llama 3. If you have needs, please download the weights yourself.

[🤗Meta-Llama-3-8B-Instruct] [🤗Meta-Llama-3-70B-Instruct]

Evaluation

We refer to the evaluation data organization method of LLava-1.5, which can be found in the following documents. Evaluation.md

bash scripts/eval/mme.sh
bash scripts/eval/mmb_cn.sh
bash scripts/eval/mmb_en.sh
bash scripts/eval/refcoco.sh
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash ./scripts/eval/gqa.sh
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash ./scripts/eval/vqav2.sh
bash scripts/eval/llavabench.sh
bash scripts/eval/mmmu.sh
bash scripts/eval/pope.sh
bash scripts/eval/textvqa.sh

License

This project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses. The content of this project itself is licensed under the Apache license 2.0.

Related Projects

This work wouldn't be possible without the incredible open-source code of these projects. Huge thanks!