/Awesome-Embodied-Multimodal-LLMs

Latest Advances on Embodied Multimodal LLMs (or Vison-Language-Action Models).

Awesome Embodied Multimodal LLMs
(Vison-Language-Action Models)

This is a collection of research papers about Embodied Multimodal Large Language Models (VLA models).

If you would like to include your paper or update any details (e.g., code urls, conference information), please feel free to submit a pull request or email me. Any advice is also welcome.

Table of Contents

Overview

Embodied Multimodal LLMs integrate vision information and action outputs into large language models (LLMs). Leveraging the rich knowledge and strong reasoning capabilities of LLMs, these models excel in interactively following human instructions, comprehensively understanding the real world, and effectively conducting various embodied tasks. They hold great potential to achieve Artificial General Intelligence (AGI).

Models

Title Introduction Date Code
Star
OpenVLA: An Open-Source Vision-Language-Action Model
image 2024-06-13 Github
Star
A3VLM: Actionable Articulation-Aware Vision Language Model
image 2024-06-11 Github
Publish
Embodied CoT Distillation From LLM To Off-the-shelf Agents
image 2024-05-02 -
Publish
RoboMP$^2$: A Robotic Multimodal Perception-Planning Framework with Multimodal Large Language Models
image 2024-04-07 -
Star Publish
3D-VLA: A 3D Vision-Language-Action Generative World Model
image 2024-03-14 Github
Star Publish
ShapeLLM: Universal 3D Object Understanding for Embodied Interaction
image 2024-02-27 Github
Publish
NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation
image 2024-02-24 -
Star Publish
MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World
image 2024-01-16 Github
Star Publish
ManipLLM: Embodied Multimodal Large Language Model for Object-Centric Robotic Manipulation
image 2023-12-24 Github
Star Publish
MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception
image 2023-12-12 Github
Star Publish
Towards Learning a Generalist Model for Embodied Navigation
image 2023-12-04 Github
Star Publish
LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning
image 2023-11-30 Github
Star Publish
An Embodied Generalist Agent in 3D World
image 2023-11-18 Github
Star Publish
Large Language Models as Generalizable Policies for Embodied Tasks
image 2023-10-26 Github
Publish
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
image 2023-07-28 -
Star Publish
Building Cooperative Embodied Agents Modularly with Large Language Models
image 2023-07-05 Github
Star Publish
EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought
image 2023-05-24 Github
Publish
PaLM-E: An Embodied Multimodal Language Model
image 2023-03-06 -

Datasets & Benchmark

Title Introduction Date Code
Star Publish
OpenEQA: Embodied Question Answering in the Era of Foundation Models
image 2024-06-17 Github
Star Publish
PhyScene: Physically Interactable 3D Scene Synthesis for Embodied AI
image 2024-04-15 Github
Star Publish
EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI
image 2023-12-26 Github
Star Publish
Holodeck: Language Guided Generation of 3D Embodied AI Environments
image 2023-12-14 Github
Publish
Learning Interactive Real-World Simulators
image 2023-10-09 -
Star Publish
Towards End-to-End Embodied Decision Making via Multi-modal Large Language Model: Explorations with GPT4-Vision and Beyond
image 2023-10-3 Github