Open LLM for Multimodal

Open Model

Language Model	Company/Org	Release Date	Github/Huggingface	Paper/Blog	Function	Modal	Licence
ImageBind	FAIR, Meta AI	2023.05	facebookresearch/ImageBind	ImageBind: One Embedding Space To Bind Them All ImageBind: Holistic AI learning across six modalities	cross-modal retrieval, composing modalities with arithmetic, cross-modal detection and generation	image/video, text, audio, depth, IMU, and thermal images	CC BY-NC-SA 4.0
BLIP-2	Salesforce	2023.01	blip2 hf/blip-2	BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models	image-to-text,feature extraction,image-text match	image,text	MIT
MiniGPT-4	King Abdullah University of Science and Technology	2023.05	Vision-CAIR/MiniGPT-4 hf/Vision-CAIR/MiniGPT-4	MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models	writing stories and poems inspired by given images, providing solutions to problems shown in images, teaching users how to cook based on food photos, etc.	image,text	BSD 3-Clause License
LLaVA	University of Wisconsin-Madison Microsoft Research Columbia University	2023.05	haotian-liu/LLaVA LLaVA-13b-delta-v0	Visual Instruction Tuning	general-purpose visual and language understanding	image,text	Apache-2.0

Open Data

Name	Release Date	Function	Paper/Blog	Dataset	Samples	License
LLaVA-Instruct-150K	2023.04	IFT	Visual Instruction Tuning	liuhaotian/LLaVA-Instruct-150K	150K	CC BY-SA-4.0
LAION-400M	2021.08	PreTrain	LAION-400-MILLION OPEN DATASET	laion/laion400m	400M	CC BY-SA-4.0
CC3M	2021	PreTrain	google-research-datasets/conceptual-captions	Google's Conceptual Captions	3M	Free
CC12M	2021	PreTrain	google-research-datasets/conceptual-12m	cc12m	12M	Free
SBU	2011	PreTrain	Im2Text: Describing Images Using 1 Million Captioned Photographs	sbu_captions	1M	unkown