Awesome Embodied Multimodal LLMs
(Vison-Language-Action Models)

This is a collection of research papers about Embodied Multimodal Large Language Models (VLA models).

If you would like to include your paper or update any details (e.g., code urls, conference information), please feel free to submit a pull request or email me. Any advice is also welcome.

Awesome-Embodied-Multimodal-LLMs

Overview

Embodied Multimodal LLMs integrate vision information and action outputs into large language models (LLMs). Leveraging the rich knowledge and strong reasoning capabilities of LLMs, these models excel in interactively following human instructions, comprehensively understanding the real world, and effectively conducting various embodied tasks. They hold great potential to achieve Artificial General Intelligence (AGI).

Models

Title	Date	Code
OpenVLA: An Open-Source Vision-Language-Action Model	2024-06-13	Github
A3VLM: Actionable Articulation-Aware Vision Language Model	2024-06-11	Github
Embodied CoT Distillation From LLM To Off-the-shelf Agents	2024-05-02	-
RoboMP$^2$: A Robotic Multimodal Perception-Planning Framework with Multimodal Large Language Models	2024-04-07	-
3D-VLA: A 3D Vision-Language-Action Generative World Model	2024-03-14	Github
ShapeLLM: Universal 3D Object Understanding for Embodied Interaction	2024-02-27	Github
NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation	2024-02-24	-
MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World	2024-01-16	Github
ManipLLM: Embodied Multimodal Large Language Model for Object-Centric Robotic Manipulation	2023-12-24	Github
MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception	2023-12-12	Github
Towards Learning a Generalist Model for Embodied Navigation	2023-12-04	Github
LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning	2023-11-30	Github
An Embodied Generalist Agent in 3D World	2023-11-18	Github
Large Language Models as Generalizable Policies for Embodied Tasks	2023-10-26	Github
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control	2023-07-28	-
Building Cooperative Embodied Agents Modularly with Large Language Models	2023-07-05	Github
EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought	2023-05-24	Github
PaLM-E: An Embodied Multimodal Language Model	2023-03-06	-

Datasets & Benchmark

Title	Date	Code
OpenEQA: Embodied Question Answering in the Era of Foundation Models	2024-06-17	Github
PhyScene: Physically Interactable 3D Scene Synthesis for Embodied AI	2024-04-15	Github
EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI	2023-12-26	Github
Holodeck: Language Guided Generation of 3D Embodied AI Environments	2023-12-14	Github
Learning Interactive Real-World Simulators	2023-10-09	-
Towards End-to-End Embodied Decision Making via Multi-modal Large Language Model: Explorations with GPT4-Vision and Beyond	2023-10-3	Github

tulerfeng/Awesome-Embodied-Multimodal-LLMs

Awesome Embodied Multimodal LLMs
(Vison-Language-Action Models)

Table of Contents

Overview

Models

Datasets & Benchmark

tulerfeng/Awesome-Embodied-Multimodal-LLMs

Awesome Embodied Multimodal LLMs (Vison-Language-Action Models)

Table of Contents

Overview

Models

Datasets & Benchmark

Awesome Embodied Multimodal LLMs
(Vison-Language-Action Models)