/Awesome-Embodied-AI

Paper List of Embodied AI with Foundation Models

MIT LicenseMIT

Awesome Embodied AI Awesome

Contributors: todo

Survey

  • Foundation Models in Robotics: Applications, Challenges, and the Future [paper]
  • Foundation Models for Decision Making: Problems, Methods, and Opportunities [paper]

Large Language Models (LLMs)

  • Awesome-LLM [project]
  • GPT-3: Language Models are Few-Shot Learners [paper]
  • GPT-4: GPT-4 Technical Report [project]
  • LLaMA: Open and Efficient Foundation Language Models [paper]
  • Llama 2: Open Foundation and Fine-Tuned Chat Models [paper]
  • Mistral 7B [paper]

Vision-Language Models (VLMs)

Image-Language Models

  • BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation [paper]
  • BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models [paper]
  • Learning Transferable Visual Models From Natural Language Supervision [paper]
  • Visual Instruction Tuning [paper]
  • Improved Baselines with Visual Instruction Tuning [paper]
  • Flamingo: a Visual Language Model for Few-Shot Learning [paper]
  • LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention [paper]
  • PandaGPT: One Model To Instruction-Follow Them All [paper]
  • OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models [paper]
  • InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning [paper]
  • mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality [paper]
  • MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models [paper]
  • ShareGPT4V: Improving Large Multi-Modal Models with Better Captions [paper]

Video-Language Models

  • Learning Video Representations from Large Language Models [paper]
  • VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset [paper]
  • Otter: A Multi-Modal Model with In-Context Instruction Tuning [paper]
  • Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks [paper]
  • Valley: Video Assistant with Large Language model Enhanced abilitY [paper]
  • Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration [paper]
  • World Model on Million-Length Video And Language With Blockwise RingAttention [paper]
  • Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding [paper]
  • LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models [paper]
  • VideoChat: Chat-Centric Video Understanding [paper]
  • Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding [paper]
  • Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models [paper]
  • Video-LLaVA: Learning United Visual Representation by Alignment Before Projection [paper]
  • PG-Video-LLaVA: Pixel Grounding Large Video-Language Models [paper]
  • GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation [paper]

Simulators

  • VirtualHome: Simulating Household Activities via Programs [paper]
  • Gibson Env: Real-World Perception for Embodied Agents [paper]
  • iGibson 2.0: Object-Centric Simulation for Robot Learning of Everyday Household Tasks [paper]
  • Habitat: A Platform for Embodied AI Research [paper]
  • Habitat 2.0: Training Home Assistants to Rearrange their Habitat [paper]
  • Habitat 3.0: A Co-Habitat for Humans, Avatars and Robots [paper]
  • AI2-THOR: An Interactive 3D Environment for Visual AI [paper]
  • RoboTHOR: An Open Simulation-to-Real Embodied AI Platform [paper]
  • BEHAVIOR-1K: A Benchmark for Embodied AI with 1,000 Everyday Activities and Realistic Simulation [paper]
  • ThreeDWorld:A High-Fidelity, Multi-Modal Platform for Interactive Physical Simulation [paper]
  • LIBERO: Benchmarking Knowledge Transfer in Lifelong Robot Learning [paper]
  • ProcTHOR: Large-Scale Embodied AI Using Procedural Generation [paper]
  • PyBullet:physics simulation for games, visual effects, robotics and reinforcement learning. [paper]

Video Data

  • Scaling Egocentric Vision: The EPIC-KITCHENS Dataset [paper]
  • Rescaling Egocentric Vision: Collection, Pipeline and Challenges for EPIC-KITCHENS-100 [paper]
  • Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives [paper]
  • BEHAVIOR-1K: A Human-Centered, Embodied AI Benchmark with 1,000 Everyday Activities and Realistic Simulation [paper]
  • Ego4D: Around the World in 3,000 Hours of Egocentric Video [paper]
  • Charades-Ego: A Large-Scale Dataset of Paired Third and First Person Videos [paper]
  • Delving into Egocentric Actions [paper]

Egocentric

  • Ego-Topo: Environment Affordances From Egocentric Video [paper]

High-Resolution

  • OtterHD: A High-Resolution Multi-modality Model [paper]

EAI with Foundation Models

  • 3D-LLM: Injecting the 3D World into Large Language Models [paper]
  • Reward Design with Language Models [paper]
  • Do As I Can, Not As I Say: Grounding Language in Robotic Affordances [paper]
  • Inner Monologue: Embodied Reasoning through Planning with Language Models [paper]
  • Text2Motion: from natural language instructions to feasible plans [paper]
  • VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models [paper]
  • ProgPrompt: Generating Situated Robot Task Plans using Large Language Models [paper]
  • Code as Policies: Language Model Programs for Embodied Control [paper]
  • ChatGPT for Robotics: Design Principles and Model Abilities [paper]
  • LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action [paper]
  • Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments [paper]
  • L3MVN: Leveraging Large Language Models for Visual Target Navigation [paper]
  • HomeRobot: Open-Vocabulary Mobile Manipulation [paper]
  • RoboCat: A Self-Improving Generalist Agent for Robotic Manipulation [paper]
  • Statler: State-Maintaining Language Models for Embodied Reasoning [paper]
  • Collaborating with language models for embodied reasoning [paper]
  • EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought [paper]
  • MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge [paper]
  • Voyager: An Open-Ended Embodied Agent with Large Language Models [paper]
  • Ghost in the Minecraft: Generally Capable Agents for Open-World Environments via Large Language Models with Text-based Knowledge and Memory [paper]
  • Guiding Pretraining in Reinforcement Learning with Large Language Models [paper]
  • Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents [paper]

Embodied Multi-modal Language Models

Representing Learning

  • Language-Driven Representation Learning for Robotics [paper]
  • R3M: A Universal Visual Representation for Robot Manipulation [paper]
  • VIP: Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training [paper]
  • LIV: Language-Image Representations and Rewards for Robotic Control [paper]
  • Learning Language-Conditioned Robot Behavior from Offline Data and Crowd-Sourced Annotation [paper]
  • DecisionNCE: Embodied Multimodal Representations via Implicit Preference Learning [paper]

End-to-End

  • Masked Visual Pre-training for Motor Control [paper]
  • Real-World Robot Learning with Masked Visual Pre-training [paper]
  • RT-1: Robotics Transformer for Real-World Control at Scale [paper]
  • RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control [paper]
  • Open X-Embodiment: Robotic Learning Datasets and RT-X Models [paper]
  • PaLM-E: An Embodied Multimodal Language Model [paper]
  • PaLI-X: On Scaling up a Multilingual Vision and Language Model [paper]
  • A Generalist Agent [paper]

Benchmarks