This is the partner repository for the survey paper "Foundation Models in Robotics: Applications, Challenges, and the Future". The authors hope this repository can act as a quick reference for roboticists who wish to read the relevant papers and implement the associated methods. The organization of this readme follows Figure 1 in the paper (shown above) and is thus divided into foundation models that have been applied to robotics and those that are relevant to robotics in some way.
We welcome contributions to this repository to add more resources. Please submit a pull request if you want to contribute!
- Survey
- Robotics
- Neural Scaling Laws for embodied AI
- Robot Policy Learning for Decision Making and Controls
- Language-Image Goal-Conditioned Value Learning
- Robot Task Planning Using Large Language Models
- Robot Transformers
- In-context Learning for Decision-Making
- Open-Vocabulary Robot Navigation and Manipulation
- Relevant to Robotics
- Open-Vocabulary Object Detection and 3D Classification
- Open-Vocabulary Semantic Segmentation
- Open-Vocabulary 3D Scene Representations
- Open-Vocabulary Object Representations
- Affordance Information
- Predictive Models
- Generalist AI
- Simulators
This repository is largely based on the following paper:
Foundation Models in Robotics: Applications, Challenges, and the Future
Roya Firoozi,
Johnathan Tucker,
Stephen Tian,
Anirudha Majumdar,
Jiankai Sun,
Weiyu Liu,
Yuke Zhu,
Shuran Song,
Ashish Kapoor,
Karol Hausman,
Brian Ichter,
Danny Driess,
Jiajun Wu,
Cewu Lu,
Mac Schwager
If you find this repository helpful, please consider citing:
@article{firoozi2023foundation,
title={Foundation Models in Robotics: Applications, Challenges, and the Future},
author={Firoozi, Roya and Tucker, Johnathan and Tian, Stephen and Majumdar, Anirudha and Sun, Jiankai and Liu, Weiyu and Zhu, Yuke and Song, Shuran and Kapoor, Ashish and Hausman, Karol and others},
journal={arXiv preprint arXiv:2312.07843},
year={2023}
}
- Neural Scaling Laws for Embodied AI: Neural Scaling Laws for Embodied AI [Paper]
- CLIPort: What and Where Pathways for Robotic Manipulation [Paper][Project][Code]
- Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation [Paper][Project][Code]
- Play-LMP: Learning Latent Plans from Play [Project]
- Multi-Context Imitation: Language-Conditioned Imitation Learning over Unstructured Data [Project]
- Towards A Unified Agent with Foundation Models [Paper]
- Reward Design with Language Models [Paper]
- Learning to generate better than your llm [Paper][Code]
- Guiding Pretraining in Reinforcement Learning with Large Language Models [Paper][Code]
- Motif: Intrinsic Motivation from Artificial Intelligence Feedback [Paper][Code]
- SayCan: Do As I Can, Not As I Say: Grounding Language in Robotic Affordances [Paper][Project][Code]
- Zero-Shot Reward Specification via Grounded Natural Language [Paper]
- VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models [Project]
- VIP: Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training [Paper][Project]
- LIV: Language-Image Representations and Rewards for Robotic Control [Paper][Project]
- LOReL: Learning Language-Conditioned Robot Behavior from Offline Data and Crowd-Sourced Annotation [Paper][Project]
- Text2Motion: From Natural Language Instructions to Feasible Plans [Paper][Project]
- MineDreamer: Learning to Follow Instructions via Chain-of-Imagination for Simulated-World Control [Paper][Project][Code]
- Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents [Paper][Project]
- Open-vocabulary Queryable Scene Representations for Real World Planning (NLMap) [Paper][Project]
- NL2TL: Transforming Natural Languages to Temporal Logics using Large Language Models [Paper][Project][Code]
- AutoTAMP: Autoregressive Task and Motion Planning with LLMs as Translators and Checkers[Paper][Project]
- LATTE: LAnguage Trajectory TransformEr [Paper][Code]
- Planning with Large Language Models via Corrective Re-prompting [Paper]
- Describe, explain, plan and select: interactive planning with LLMs enables open-world multi-task agents [Paper][Code]
- JARVIS-1: Open-World Multi-task Agents with Memory-Augmented Multimodal Language Models [Paper][Project][Code]
- An Embodied Generalist Agent in 3D World [Paper][Project][Code]
- LLM+P: Empowering Large Language Models with Optimal Planning Proficiency [Paper][Code]
- MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception [Paper][Project][Code]
- ProgPrompt: Generating Situated Robot Task Plans using Large Language Models [Paper][Project]
- Code as Policies: Language Model Programs for Embodied Control [Paper][Project]
- ChatGPT for Robotics: Design Principles and Model Abilities [Paper][Project][Code]
- Voyager: An Open-Ended Embodied Agent with Large Language Models [Paper][Project]
- Visual Programming: Compositional visual reasoning without training [Paper][Project][Code]
- Deploying and Evaluating LLMs to Program Service Mobile Robots [Paper][Project][Code]
- MotionGPT: Finetuned LLMs are General-Purpose Motion Generators [Paper][Project]
- RT-1: Robotics Transformer for Real-World Control at Scale [Paper][Project][Code]
- Masked Visual Pre-training for Motor Control [Paper][Project][Code]
- Real-world robot learning with masked visual pre-training [Paper][Project]
- R3M: A Universal Visual Representation for Robot Manipulation [Paper][Project][Code]
- Robot Learning with Sensorimotor Pre-training [Paper][Project]
- Rt-2: Vision-language-action models transfer web knowledge to robotic control [Paper][Project]
- PACT: Perception-Action Causal Transformer for Autoregressive Robotics Pre-Training [Paper]
- GROOT: Learning to Follow Instructions by Watching Gameplay Videos [Paper][Project][Code]
- Behavior Transformers (BeT): Cloning k modes with one stone [Paper][Project][Code]
- Conditional Behavior Transformers (C-BeT), From Play to Policy: Conditional Behavior Generation from Uncurated Robot Data [Paper][Project][Code]
- MAGICVFM: Meta-learning Adaptation for Ground Interaction Control with Visual Foundation Models [Paper]
- A Survey on In-context Learning [Paper]
- Large Language Models as General Pattern Machines [Paper]
- Chain-of-Thought Predictive Control [Paper]
- ReAct: Synergizing Reasoning and Acting in Language Models [Paper]
- ICRT: In-Context Imitation Learning via Next-Token Prediction [Paper] [Project] [Code]
- CoWs on PASTURE: Baselines and Benchmarks for Language-Driven Zero-Shot Object Navigation [Paper][Project][Code]
- Open-vocabulary Queryable Scene Representations for Real World Planning (NLMap) [Paper][Project]
- LSC: Language-guided Skill Coordination for Open-Vocabulary Mobile Pick-and-Place [Paper][Project]
- L3MVN: Leveraging Large Language Models for Visual Target Navigation [Project]
- Open-World Object Manipulation using Pre-trained Vision-Language Models [Paper][Project]
- VIMA: General Robot Manipulation with Multimodal Prompts [Paper][Project][Code]
- Diffusion-based Generation, Optimization, and Planning in 3D Scenes [Paper][Project][Code]
- LOTUS: Continual Imitation Learning for Robot Manipulation Through Unsupervised Skill Discovery [Paper] [Project]
- Imitating Shortest Paths in Simulation Enables Effective Navigation and Manipulation in the Real World [Paper] [Project]
- ThinkBot: Embodied Instruction Following with Thought Chain Reasoning [Paper] [Project]
- CLIP-Fields: Weakly Supervised Semantic Fields for Robotic Memory [Paper] [Project] [Code]
- USA-Net: Unified Semantic and Affordance Representations for Robot Memory [Paper] [Project] [Code]
- Simple Open-Vocabulary Object Detection with Vision Transformers [Paper][Code]
- Grounded Language-Image Pre-training [Paper][Code]
- Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection [Paper][Code]
- PointCLIP: Point Cloud Understanding by CLIP [Paper][Code]
- Point-bert: Pre-training 3d point cloud transformers with masked point modeling [Paper][Code]
- ULIP: Learning a Unified Representation of Language, Images, and Point Clouds for 3D Understanding [Paper][Project][Code]
- Ulip-2: Towards scalable multimodal pre-training for 3d understanding [Paper][Code]
- 3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment [Paper][Project][Code]
- Language-driven Semantic Segmentation [Paper][Code]
- Emerging Properties in Self-Supervised Vision Transformers [Paper][Code]
- Segment Anything [Paper][Project]
- Fast segment anything [Paper][Code]
- Faster Segment Anything: Towards Lightweight SAM for Mobile Applications [Paper][Code]
- Track anything: Segment anything meets videos [Paper][Code]
- Open-vocabulary Queryable Scene Representations for Real World Planning (NLMap) [Paper][Project]
- Clip-NeRF: Text-and-image driven manipulation of neural radiance fields [Paper][Project]
- CLIP-Fields: Weakly Supervised Semantic Fields for Robotic Memory [Paper] [Project] [Code]
- LERF: Language Embedded Radiance Fields [Paper][Project][Code]
- Decomposing nerf for editing via feature field distillation [Paper][Project]
- FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects [Paper][Project]
- BundleSDF: Neural 6-DoF Tracking and 3D Reconstruction of Unknown Objects [Paper][Project]
- Neural Descriptor Fields: SE(3)-Equivariant Object Representations for Manipulation [Paper][Project]
- Distilled Feature Fields Enable Few-Shot Language-Guided Manipulation [Paper][Project]
- You Only Look at One: Category-Level Object Representations for Pose Estimation From a Single Example [Paper]
- Zero-Shot Category-Level Object Pose Estimation [Paper][Code]
- VIOLA: Imitation Learning for Vision-Based Manipulation with Object Proposal Priors [Paper][Project][Code]
- Learning Generalizable Manipulation Policies with Object-Centric 3D Representations [Paper][Project][Code]
- Affordance Diffusion: Synthesizing Hand-Object Interactions [Paper][Project]
- Affordances from Human Videos as a Versatile Representation for Robotics [Paper][Project]
- Adversarial Inverse Reinforcement Learning With Self-Attention Dynamics Model [Paper]
- Connected Autonomous Vehicle Motion Planning with Video Predictions from Smart, Self-Supervised Infrastructure [Paper]
- Self-Supervised Traffic Advisors: Distributed, Multi-view Traffic Prediction for Smart Cities [Paper]
- Planning with diffusion for flexible behavior synthesis [Paper]
- Phenaki: Variable-length video generation from open domain textual description [Paper]
- Robonet: Large-scale multi-robot learning [Paper]
- GAIA-1: A Generative World Model for Autonomous Driving [Paper]
- Learning universal policies via text-guided video generation [Paper]
- Video language planning [Paper]
- MineDreamer: Learning to Follow Instructions via Chain-of-Imagination for Simulated-World Control [Paper][Project][Code]
- Inner Monologue: Embodied Reasoning through Planning with Language Models [Paper][Project]
- Statler: State-Maintaining Language Models for Embodied Reasoning [Paper][Project]
- EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought [Paper][Project]
- MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge [Paper][Code]
- Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos [Paper]
- Open-World Multi-Task Control Through Goal-Aware Representation Learning and Adaptive Horizon Prediction [Paper][Code]
- Describe, explain, plan and select: interactive planning with LLMs enables open-world multi-task agents [Paper][Code]
- Voyager: An Open-Ended Embodied Agent with Large Language Models [Paper][Project][Code]
- Ghost in the Minecraft: Generally Capable Agents for Open-World Environments via Large Language Models with Text-based Knowledge and Memory [Paper][Project]
- Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents [Paper][Project][Code]
- GROOT: Learning to Follow Instructions by Watching Gameplay Videos [Paper][Project][Code]
- JARVIS-1: Open-World Multi-task Agents with Memory-Augmented Multimodal Language Models [Paper][Project][Code]
- SQA3D: Situated Question Answering in 3D Scenes [Paper][Project][Code]
- MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception [Paper][Project][Code]
- MineDreamer: Learning to Follow Instructions via Chain-of-Imagination for Simulated-World Control [Paper][Project][Code]
- Generative Agents: Interactive Simulacra of Human Behavior [Paper]
- Towards Generalist Robots: A Promising Paradigm via Generative Simulation [Paper]
- A generalist agent [Paper]
- An Embodied Generalist Agent in 3D World [Paper][Project][Code]
- Gibson Env: real-world perception for embodied agents [Paper]
- iGibson 2.0: Object-Centric Simulation for Robot Learning of Everyday Household Tasks [Paper][Project]
- BEHAVIOR-1k: A benchmark for embodied AI with 1,000 everyday activities and realistic simulation [Paper][Project]
- Habitat: A Platform for Embodied AI Research [Paper][Project]
- Habitat 2.0: Training home assistants to rearrange their habitat [Paper]
- Robothor: An open simulation-to-real embodied ai platform [Paper][Project]
- VirtualHome: Simulating Household Activities via Programs [Paper]
- ARNOLD: A Benchmark for Language-Grounded Task Learning With Continuous States in Realistic 3D Scenes [Paper][Project][Code]
- ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks [Paper][Project][Code]
- LIBERO: Benchmarking Knowledge Transfer in Lifelong Robot Learning [Paper][Project][Code]
- ProcTHOR: Large-Scale Embodied AI Using Procedural Generation [Paper][Project][Code]