multi-modal
There are 270 repositories under multi-modal topic.
modelscope/modelscope
ModelScope: bring the notion of Model-as-a-Service to life.
lucidrains/DALLE-pytorch
Implementation / replication of DALL-E, OpenAI's Text to Image Transformer, in Pytorch
THUDM/CogVLM
a state-of-the-art-level open visual language model | 多模态预训练模型
valhalla/valhalla
Open Source Routing Engine for OpenStreetMap
marqo-ai/marqo
Unified embedding generation and search engine. Also available on cloud - cloud.marqo.ai
THUDM/VisualGLM-6B
Chinese and English multimodal conversational language model | 多模态中英双语对话语言模型
OFA-Sys/Chinese-CLIP
Chinese version of CLIP which achieves Chinese cross-modal retrieval and representation generation.
zjunlp/DeepKE
[EMNLP 2022] An Open Toolkit for Knowledge Graph Extraction and Construction
OpenGVLab/InternVL
[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4V. 接近GPT-4V表现的可商用开源多模态对话模型
docarray/docarray
Represent, send, store and search multimodal data
PKU-YuanGroup/Video-LLaVA
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
modelscope/agentscope
Start building LLM-empowered multi-agent applications in an easier way.
SciSharp/LLamaSharp
A C#/.NET library to run LLM (🦙LLaMA/LLaVA) on your local device efficiently.
Kav-K/GPTDiscord
A robust, all-in-one GPT interface for Discord. ChatGPT-style conversations, image generation, AI-moderation, custom indexes/knowledgebase, youtube summarizer, and more!
PKU-YuanGroup/MoE-LLaVA
Mixture-of-Experts for Large Vision-Language Models
modelscope/data-juicer
A one-stop data processing system to make data higher-quality, juicier, and more digestible for LLMs! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷为大语言模型提供更高质量、更丰富、更易”消化“的数据!
dvlab-research/LISA
Project Page for "LISA: Reasoning Segmentation via Large Language Model"
QIN2DIM/hcaptcha-challenger
🥂 Gracefully face hCaptcha challenge with MoE(ONNX) embedded solution.
DirtyHarryLYL/Transformer-in-Vision
Recent Transformer-based CV and related works.
OpenMotionLab/MotionGPT
[NeurIPS 2023] MotionGPT: Human Motion as a Foreign Language, a unified motion-language generation model using LLMs
THUDM/CogVLM2
GPT4V-level open-source multi-modal model based on Llama3-8B
tangxyw/RecSysPapers
推荐/广告/搜索领域工业界经典以及最前沿论文集合。A collection of industry classics and cutting-edge papers in the field of recommendation/advertising/search.
MedMNIST/MedMNIST
[pip install medmnist] 18x Standardized Datasets for 2D and 3D Biomedical Image Classification
IntelLabs/fastRAG
Efficient Retrieval Augmentation and Generation Framework
vercel/modelfusion
The TypeScript library for building AI applications.
bytedance/SALMONN
SALMONN: Speech Audio Language Music Open Neural Network
jokieleung/awesome-visual-question-answering
A curated list of Visual Question Answering(VQA)(Image/Video Question Answering),Visual Question Generation ,Visual Dialog ,Visual Commonsense Reasoning and related area.
microsoft/farmvibes-ai
FarmVibes.AI: Multi-Modal GeoSpatial ML Models for Agriculture and Sustainability
salesforce/UniControl
Unified Controllable Visual Generation Model
PKU-YuanGroup/LanguageBind
【ICLR 2024🔥】 Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
open-compass/VLMEvalKit
Open-source evaluation toolkit of large vision-language models (LVLMs), support GPT-4v, Gemini, QwenVLPlus, 50+ HF models, 20+ benchmarks
v-iashin/SpecVQGAN
Source code for "Taming Visually Guided Sound Generation" (Oral at the BMVC 2021)
boschresearch/OASIS
Official implementation of the paper "You Only Need Adversarial Supervision for Semantic Image Synthesis" (ICLR 2021)
Tebmer/Awesome-Knowledge-Distillation-of-LLMs
This repository collects papers for "A Survey on Knowledge Distillation of Large Language Models". We break down KD into Knowledge Elicitation and Distillation Algorithms, and explore the Skill & Vertical Distillation of LLMs.
wangsuzhen/Audio2Head
code for paper "Audio2Head: Audio-driven One-shot Talking-head Generation with Natural Head Motion" in the conference of IJCAI 2021
kyegomez/RT-2
Democratization of RT-2 "RT-2: New model translates vision and language into action"