Awesome Embodied AI

Contributors: todo

Survey

Foundation Models in Robotics: Applications, Challenges, and the Future [paper]
Foundation Models for Decision Making: Problems, Methods, and Opportunities [paper]

Large Language Models (LLMs)

Awesome-LLM [project]
GPT-3: Language Models are Few-Shot Learners [paper]
GPT-4: GPT-4 Technical Report [project]
LLaMA: Open and Efficient Foundation Language Models [paper]
Llama 2: Open Foundation and Fine-Tuned Chat Models [paper]
Mistral 7B [paper]

Vision-Language Models (VLMs)

Image-Language Models

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation [paper]
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models [paper]
Learning Transferable Visual Models From Natural Language Supervision [paper]
Visual Instruction Tuning [paper]
Improved Baselines with Visual Instruction Tuning [paper]
Flamingo: a Visual Language Model for Few-Shot Learning [paper]
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention [paper]
PandaGPT: One Model To Instruction-Follow Them All [paper]
OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models [paper]
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning [paper]
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality [paper]
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models [paper]
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions [paper]

Video-Language Models

Learning Video Representations from Large Language Models [paper]
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset [paper]
Otter: A Multi-Modal Model with In-Context Instruction Tuning [paper]
Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks [paper]
Valley: Video Assistant with Large Language model Enhanced abilitY [paper]
Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration [paper]
World Model on Million-Length Video And Language With Blockwise RingAttention [paper]
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding [paper]
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models [paper]
VideoChat: Chat-Centric Video Understanding [paper]
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding [paper]
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models [paper]
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection [paper]
PG-Video-LLaVA: Pixel Grounding Large Video-Language Models [paper]
GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation [paper]

Simulators

VirtualHome: Simulating Household Activities via Programs [paper]
Gibson Env: Real-World Perception for Embodied Agents [paper]
iGibson 2.0: Object-Centric Simulation for Robot Learning of Everyday Household Tasks [paper]
Habitat: A Platform for Embodied AI Research [paper]
Habitat 2.0: Training Home Assistants to Rearrange their Habitat [paper]
Habitat 3.0: A Co-Habitat for Humans, Avatars and Robots [paper]
AI2-THOR: An Interactive 3D Environment for Visual AI [paper]
RoboTHOR: An Open Simulation-to-Real Embodied AI Platform [paper]
BEHAVIOR-1K: A Benchmark for Embodied AI with 1,000 Everyday Activities and Realistic Simulation [paper]
ThreeDWorld：A High-Fidelity, Multi-Modal Platform for Interactive Physical Simulation [paper]
LIBERO: Benchmarking Knowledge Transfer in Lifelong Robot Learning [paper]
ProcTHOR: Large-Scale Embodied AI Using Procedural Generation [paper]
PyBullet：physics simulation for games, visual effects, robotics and reinforcement learning. [paper]

Video Data

Scaling Egocentric Vision: The EPIC-KITCHENS Dataset [paper]
Rescaling Egocentric Vision: Collection, Pipeline and Challenges for EPIC-KITCHENS-100 [paper]
Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives [paper]
BEHAVIOR-1K: A Human-Centered, Embodied AI Benchmark with 1,000 Everyday Activities and Realistic Simulation [paper]
Ego4D: Around the World in 3,000 Hours of Egocentric Video [paper]
Charades-Ego: A Large-Scale Dataset of Paired Third and First Person Videos [paper]
Delving into Egocentric Actions [paper]

Egocentric

Ego-Topo: Environment Affordances From Egocentric Video [paper]

High-Resolution

OtterHD: A High-Resolution Multi-modality Model [paper]

EAI with Foundation Models

3D-LLM: Injecting the 3D World into Large Language Models [paper]
Reward Design with Language Models [paper]
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances [paper]
Inner Monologue: Embodied Reasoning through Planning with Language Models [paper]
Text2Motion: from natural language instructions to feasible plans [paper]
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models [paper]
ProgPrompt: Generating Situated Robot Task Plans using Large Language Models [paper]
Code as Policies: Language Model Programs for Embodied Control [paper]
ChatGPT for Robotics: Design Principles and Model Abilities [paper]
LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action [paper]
Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments [paper]
L3MVN: Leveraging Large Language Models for Visual Target Navigation [paper]
HomeRobot: Open-Vocabulary Mobile Manipulation [paper]
RoboCat: A Self-Improving Generalist Agent for Robotic Manipulation [paper]
Statler: State-Maintaining Language Models for Embodied Reasoning [paper]
Collaborating with language models for embodied reasoning [paper]
EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought [paper]
MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge [paper]
Voyager: An Open-Ended Embodied Agent with Large Language Models [paper]
Ghost in the Minecraft: Generally Capable Agents for Open-World Environments via Large Language Models with Text-based Knowledge and Memory [paper]
Guiding Pretraining in Reinforcement Learning with Large Language Models [paper]
Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents [paper]

Embodied Multi-modal Language Models

Representing Learning

Language-Driven Representation Learning for Robotics [paper]
R3M: A Universal Visual Representation for Robot Manipulation [paper]
VIP: Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training [paper]
LIV: Language-Image Representations and Rewards for Robotic Control [paper]
Learning Language-Conditioned Robot Behavior from Offline Data and Crowd-Sourced Annotation [paper]
DecisionNCE: Embodied Multimodal Representations via Implicit Preference Learning [paper]

End-to-End

Masked Visual Pre-training for Motor Control [paper]
Real-World Robot Learning with Masked Visual Pre-training [paper]
RT-1: Robotics Transformer for Real-World Control at Scale [paper]
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control [paper]
Open X-Embodiment: Robotic Learning Datasets and RT-X Models [paper]
PaLM-E: An Embodied Multimodal Language Model [paper]
PaLI-X: On Scaling up a Multilingual Vision and Language Model [paper]
A Generalist Agent [paper]

AdaCheng/Awesome-Embodied-AI

Awesome Embodied AI

Survey

Large Language Models (LLMs)

Vision-Language Models (VLMs)

Image-Language Models

Video-Language Models

Simulators

Video Data

Egocentric

High-Resolution

EAI with Foundation Models

Embodied Multi-modal Language Models

Representing Learning

End-to-End

Benchmarks