InternLM-XComposer

InternLM-XComposer ๐Ÿค— ๐Ÿค–   ๏ฝœ InternLM-XComposer-VL ๐Ÿค— ๐Ÿค–   | Technical Report ๐Ÿ“„

English | ็ฎ€ไฝ“ไธญๆ–‡

๐Ÿ‘‹ join us on Discord and WeChat


InternLM-XComposer is a vision-language large model (VLLM) based on InternLM for advanced text-image comprehension and composition. InternLM-XComposer has several appealing properties:

  • Interleaved Text-Image Composition: InternLM-XComposer can effortlessly generate coherent and contextual articles that seamlessly integrate images, providing a more engaging and immersive reading experience. The interleaved text-image composition is implemented in following steps:

    1. Text Generation: It crafts long-form text based on human-provided instructions.
    2. Image Spoting and Captioning: It pinpoints optimal locations for image placement and furnishes image descriptions.
    3. Image Retrieval and Selection: It select image candidates and identify the image that optimally complements the content.
  • Comprehension with Rich Multilingual Knowledge: The text-image comprehension is empowered by training on extensive multi-modal multilingual concepts with carefully crafted strategies, resulting in a deep understanding of visual content.

  • Strong performance: It consistently achieves state-of-the-art results across various benchmarks for vision-language large models, including MME Benchmark (English), MMBench (English), Seed-Bench (English), MMBench-CN(Chinese), and CCBench(Chinese).

We release InternLM-XComposer series in two versions:

  • InternLM-XComposer-VL-7B ๐Ÿค— ๐Ÿค– : The pretrained and multi-task trained VLLM model with InternLM as the initialization of the LLM, achieving strong performance on various multimodal benchmarks, e.g., MME Benchmark, MMBench Seed-Bench, CCBench, and MMBench-CN.
  • InternLM-XComposer-7B ๐Ÿค— ๐Ÿค– : The further instruction tuned VLLM for Interleaved Text-Image Composition and LLM-based AI assistant.

Please refer to Technical Report for more details.

Demo

demo.mp4

Please refer to Chinese Demo for the demo of the Chinese version.

News and Updates


Evaluation

We evaluate InternLM-XComposer-VL on five multimodal benchmarks: MME Benchmark, MMBench, Seed-Bench in the English language, CCBench, MMBench-CN in the simplified chinese language.

  • MME Benchmark: A comprehensive evaluation benchmark for multimodal large language models with 14 subtasks.
  • MMBench: A comprehensive evaluation pipeline comprised of meticulously curated multimodal dataset and a novel circulareval strategy using ChatGPT.
  • MMBench-CN: A simplified chinese language version of MMBench.
  • Seed-Bench: A multimodal benchmark of 19K multiple-choice questions with accurate human annotations for evaluating Multimodal LLMs.
  • CCBench: A multimodal benchmark for chinese cultural comprehension.

InternLM-XComposer-VL outperforms existing vision-language large models on all the five benchmarks, demonstrating stronger multilingual comprehension ability.

MME Benchmark

MME is a comprehensive evaluation benchmark for multimodal large language models. It measures both perception and cognition abilities on a total of 14 subtasks, including existence, count, position, color, poster, celebrity, scene, landmark, artwork, OCR, commonsense reasoning, numerical calculation, text translation, and code reasoning.

InternLM-XComposer-VL achieves SOTAs on overall performance evaluation. See more details on HERE.

Overall Performance

Rank Model Version Score
๏ธ 1 InternLM-XComposer-VL InternLM-7B 1919.5
2 Qwen-VL-Chat Qwen-7B 1848.3
3 MMICL FlanT5xxl 1810.7
4 Skywork-MM Skywork-MM-13B 1775.5
5 BLIVA FlanT5xxl 1669.2

MMBench & MMBench-CN

MMBench is a comprehensive evaluation pipeline comprised of meticulously curated multimodal dataset and a novel circulareval strategy using ChatGPT. It is comprised of 20 ability dimensions defined by MMBench. MMBench-CN is the Chinese language version of MMBench.

InternLM-XComposer-VL a chieves SOTAs on the test split of both MMBench and MMBench-CN. See more details on HERE.

MMBench Test Split

Rank Model Version Score
๏ธ 1 InternLM-XComposer-VL InternLM-7B 74.4
2 Pink Vicuna-7B 74.1
3 JiuTian FLANT5-XXL 71.8
4 WeMM InternLM-7B 69.0
5 mPLUG-Owl LLaMA2 7B 68.5

MMBench-CN Test Split

Rank Model Version Score
๏ธ 1 InternLM-XComposer-VL InternLM-7B 72.4
2 QWen-VL-Chat Qwen-7B 56.3
3 LLaVA LLaMA 7B 36.6
4 VosualGLM ChatGLM 6B 25.6
5 mPLUG-Owl LLaMA2 7B 24.9

SEED-Bench

SEED-Bench is a multimodal benchmark of 19K multiple-choice questions with accurate human annotations for evaluating Multimodal LLMs, covering 12 evaluation dimensions including both image and video understanding. See more details on HERE.

InternLM-XComposer-VL achieves SOTAs on this benchmark for images.

SeedBench Image Evaluation

Rank Model Version Score
๏ธ 1 InternLM-XComposer-VL InternLM-7B 66.9
2 QWen-VL-Chat Qwen-7B 65.4
3 QWen-VL Qwen-7B 62.3
4 InstructBLIP-Vicuna Vicuna 7B 58.8
5 InstructBLIP Flan-T5-XL 57.8

CCBench

CCBench is a multimodal benchmark for chinese cultural comprehension. See more details on HERE.

CCBench Performance

Rank Model Version Score
๏ธ 1 InternLM-XComposer-VL InternLM-7B 47.6
2 QWen-VL-Chat Qwen-7B 39.3
3 mPLUG-Owl LLaMA2 7B 12.9
3 InstructBLIP Vicuna 7B 12.1
4 VosualGLM ChatGLM 6B 9.2

Requirements

  • python 3.8 and above
  • pytorch 1.12 and above, 2.0 and above are recommended
  • CUDA 11.4 and above are recommended (this is for GPU users)

Installation

Before running the code, make sure you have setup the environment and installed the required packages. Make sure you meet the above requirements, and then install the dependent libraries. Please refer to the installation instructions

Quickstart

We provide a simple example to show how to use InternLM-XComposer with ๐Ÿค— Transformers.

๐Ÿค— Transformers

import torch
from transformers import AutoModel, AutoTokenizer

torch.set_grad_enabled(False)

# init model and tokenizer
model = AutoModel.from_pretrained('internlm/internlm-xcomposer-7b', trust_remote_code=True).cuda().eval()
tokenizer = AutoTokenizer.from_pretrained('internlm/internlm-xcomposer-7b', trust_remote_code=True)
model.tokenizer = tokenizer

# example image
image = 'examples/images/aiyinsitan.jpg'

# Single-Turn Pure-Text Dialogue
text = 'Please introduce Einstein.'
response = model.generate(text)
print(response)
# 'Albert Einstein was a German-born theoretical physicist. He developed the general theory of relativity, 
# one of the two pillars of modern physics (alongside quantum mechanics). His work is also known for its influence 
# on the philosophy of science. In 1921, Einstein was awarded the Nobel Prize in Physics "for his services to 
# theoretical physics, and especially for his discovery of the law of the photoelectric effect.'


# Single-Turn Text-Image Dialogue
text = 'Please introduce the person in this picture in detail.'
image = 'examples/images/aiyinsitan.jpg'
response = model.generate(text, image)
print(response)
# 'The person in the picture is Albert Einstein, a renowned theoretical physicist and one of the most influential 
# scientists of the 20th century. He was born on March 14, 1879, in Ulm, Germany, and died on April 18, 1955, 
# in Princeton, New Jersey.'


# Multi-Turn Text-Image Dialogue
# 1st turn
text = 'Who is in the picture?'
response, history = model.chat(text=text, image=image, history=None)
print(response)
# 'Albert Einstein is in the picture.'

# 2nd turn
text = 'What are his achievements?'
response, history = model.chat(text=text, image=None, history=history)
print(response)
# 'Albert Einstein was a German-born theoretical physicist who developed the general theory of relativity, 
# one of the two pillars of modern physics (alongside quantum mechanics). He is best known for his massโ€“energy 
# equivalence formula E = mc2 (which has been dubbed "the world's most famous equation"), and his explanation of 
# the photoelectric effect, both of which are examples of his special and general theories of relativity.'

# 3rd turn
text = 'Is he the greatest physicist?'
response, history = model.chat(text=text, image=None, history=history)
print(response)
# 'Yes, Albert Einstein is widely regarded as one of the greatest physicists of all time'

๐Ÿค– ModelScope

import torch
from modelscope import snapshot_download, AutoModel, AutoTokenizer

torch.set_grad_enabled(False)

# init model and tokenizer
model_dir = snapshot_download('Shanghai_AI_Laboratory/internlm-xcomposer-7b')
model = AutoModel.from_pretrained(model_dir, trust_remote_code=True).cuda().eval()
tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
model.tokenizer = tokenizer

# example image
image = 'examples/images/aiyinsitan.jpg'

# Single-Turn Pure-Text Dialogue
text = 'Please introduce Einstein.'
response = model.generate(text)
print(response)
# 'Albert Einstein was a German-born theoretical physicist. He developed the general theory of relativity, 
# one of the two pillars of modern physics (alongside quantum mechanics). His work is also known for its influence 
# on the philosophy of science. In 1921, Einstein was awarded the Nobel Prize in Physics "for his services to 
# theoretical physics, and especially for his discovery of the law of the photoelectric effect.'

Web UI

We provide code for users to build a web UI demo.

Please run the command below:

python examples/web_demo.py

The user guidance of UI demo is given in HERE.

Citation

If you find our paper and code useful in your research, please consider giving a star โญ and citation ๐Ÿ“ :)

@misc{zhang2023internlmxcomposer,
      title={InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition}, 
      author={Pan Zhang and Xiaoyi Dong and Bin Wang and Yuhang Cao and Chao Xu and Linke Ouyang and Zhiyuan Zhao and Shuangrui Ding and Songyang Zhang and Haodong Duan and Wenwei Zhang and Hang Yan and Xinyue Zhang and Wei Li and Jingwen Li and Kai Chen and Conghui He and Xingcheng Zhang and Yu Qiao and Dahua Lin and Jiaqi Wang},
      year={2023},
      eprint={2309.15112},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

License & Contact Us

The code is licensed under Apache-2.0, while model weights are fully open for academic research and also allow free commercial usage. To apply for a commercial license, please fill in the application form (English)/็”ณ่ฏท่กจ๏ผˆไธญๆ–‡๏ผ‰. For other questions or collaborations, please contact internlm@pjlab.org.cn.