This is the partner repository for the survey paper "Foundation Models in Robotics: Applications, Challenges, and the Future". The authors hope this repository can act as a quick reference for roboticists who wish to read the relevant papers and implement the associated methods. The organization of this readme follows Figure 1 in the paper (shown above) and is thus divided into foundation models that have been applied to robotics and those that are relevant to robotics in some way.
We welcome contributions to this repository to add more resources. Please submit a pull request if you want to contribute!
- Survey
- Robotics
- Robot Policy Learning for Decision Making and Controls
- Language-Image Goal-Conditioned Value Learning
- Robot Task Planning Using Large Language Models
- Robot Transformers
- In-context Learning for Decision-Making
- Open-Vocabulary Robot Navigation and Manipulation
- Relevant to Robotics
- Open-Vocabulary Object Detection and 3D Classification
- Open-Vocabulary Semantic Segmentation
- Open-Vocabulary 3D Scene Representations
- Open-Vocabulary Object Representations
- Affordance Information
- Predictive Models
- Generalist AI
- Simulators
This repository is largely based on the following paper:
Foundation Models in Robotics: Applications, Challenges, and the Future
Roya Firoozi
Jiankai Sun,
Johnathan Tucker,
Anirudha Majumdar,
Yuke Zhu,
Shuran Song,
Ashish Kapoor,
Weiyu Liu,
Stephen Tian,
Karol Hausman,
Brian Ichter,
Danny Driess,
Jiajun Wu,
Cewu Lu,
Mac Schwager
If you find this repository helpful, please consider citing:
- CLIPort: What and Where Pathways for Robotic Manipulation [Paper][Project][Code]
- Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation [Paper][Project][Code]
- Play-LMP: Learning Latent Plans from Play [Project]
- Multi-Context Imitation: Language-Conditioned Imitation Learning over Unstructured Data [Project]
- Towards A Unified Agent with Foundation Models [Paper]
- Reward Design with Language Models [Paper]
- Learning to generate better than your llm [Paper][Code]
- Guiding Pretraining in Reinforcement Learning with Large Language Models [Paper][Code]
- SayCan: Do As I Can, Not As I Say: Grounding Language in Robotic Affordances [Paper][Project][Code]
- Zero-Shot Reward Specification via Grounded Natural Language [Paper]
- VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models [Project]
- VIP: Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training [Paper][Project]
- LIV: Language-Image Representations and Rewards for Robotic Control [Paper][Project]
- LOReL: Learning Language-Conditioned Robot Behavior from Offline Data and Crowd-Sourced Annotation [Paper][Project]
- Text2Motion: From Natural Language Instructions to Feasible Plans [Paper][Project]
- Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents [Paper][Project]
- NL2TL: Transforming Natural Languages to Temporal Logics using Large Language Models [Paper][Project][Code]
- AutoTAMP: Autoregressive Task and Motion Planning with LLMs as Translators and Checkers[Paper][Project]
- LATTE: LAnguage Trajectory TransformEr [Paper][Code]
- Planning with Large Language Models via Corrective Re-prompting [Paper]
- Describe, explain, plan and select: interactive planning with LLMs enables open-world multi-task agents [Paper][Code]
- JARVIS-1: Open-World Multi-task Agents with Memory-Augmented Multimodal Language Models [Paper][Project][Code]
- An Embodied Generalist Agent in 3D World [Paper][Project][Code]
- ProgPrompt: Generating Situated Robot Task Plans using Large Language Models [Paper][Project]
- Code as Policies: Language Model Programs for Embodied Control [Paper][Project]
- ChatGPT for Robotics: Design Principles and Model Abilities [Paper][Project][Code]
- Voyager: An Open-Ended Embodied Agent with Large Language Models [Paper][Project]
- Visual Programming: Compositional visual reasoning without training [Paper][Project][Code]
- MotionGPT: Finetuned LLMs are General-Purpose Motion Generators [Paper][Project]
- RT-1: Robotics Transformer for Real-World Control at Scale [Paper][Project][Code]
- Masked Visual Pre-training for Motor Control [Paper][Project][Code]
- Real-world robot learning with masked visual pre-training [Paper][Project]
- Rt-2: Vision-language-action models transfer web knowledge to robotic control [Paper][Project]
- PACT: Perception-Action Causal Transformer for Autoregressive Robotics Pre-Training [Paper]
- GROOT: Learning to Follow Instructions by Watching Gameplay Videos [Paper][Project][Code]
- A Survey on In-context Learning [Paper]
- Large Language Models as General Pattern Machines [Paper]
- Chain-of-Thought Predictive Control [Paper]
- ReAct: Synergizing Reasoning and Acting in Language Models [Paper]
- CoWs on PASTURE: Baselines and Benchmarks for Language-Driven Zero-Shot Object Navigation [Paper][Project][Code]
- LSC: Language-guided Skill Coordination for Open-Vocabulary Mobile Pick-and-Place [Paper][Project]
- L3MVN: Leveraging Large Language Models for Visual Target Navigation [Project]
- Open-World Object Manipulation using Pre-trained Vision-Language Models [Paper][Project]
- VIMA: General Robot Manipulation with Multimodal Prompts [Paper][Project][Code]
- Diffusion-based Generation, Optimization, and Planning in 3D Scenes [Paper][Project][Code]
- Simple Open-Vocabulary Object Detection with Vision Transformers [Paper][Code]
- Grounded Language-Image Pre-training [Paper][Code]
- Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection [Paper][Code]
- PointCLIP: Point Cloud Understanding by CLIP [Paper][Code]
- Point-bert: Pre-training 3d point cloud transformers with masked point modeling [Paper][Code]
- ULIP: Learning a Unified Representation of Language, Images, and Point Clouds for 3D Understanding [Paper][Project][Code]
- Ulip-2: Towards scalable multimodal pre-training for 3d understanding [Paper][Code]
- 3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment [Paper][Project][Code]
- Language-driven Semantic Segmentation [Paper][Code]
- Emerging Properties in Self-Supervised Vision Transformers [Paper][Code]
- Segment Anything [Paper][Project]
- Fast segment anything [Paper][Code]
- Faster Segment Anything: Towards Lightweight SAM for Mobile Applications [Paper][Code]
- Track anything: Segment anything meets videos [Paper][Code]
- Clip-NeRF: Text-and-image driven manipulation of neural radiance fields [Paper][Project]
- LERF: Language Embedded Radiance Fields [Paper][Project][Code]
- Decomposing nerf for editing via feature field distillation [Paper][Project]
- Neural Descriptor Fields: SE(3)-Equivariant Object Representations for Manipulation [Paper][Project]
- Distilled Feature Fields Enable Few-Shot Language-Guided Manipulation [Paper][Project]
- You Only Look at One: Category-Level Object Representations for Pose Estimation From a Single Example [Paper]
- Zero-Shot Category-Level Object Pose Estimation [Paper][Code]
- Affordance Diffusion: Synthesizing Hand-Object Interactions [Paper][Project]
- Affordances from Human Videos as a Versatile Representation for Robotics [Paper][Project]
- Adversarial Inverse Reinforcement Learning With Self-Attention Dynamics Model [Paper]
- Connected Autonomous Vehicle Motion Planning with Video Predictions from Smart, Self-Supervised Infrastructure [Paper]
- Self-Supervised Traffic Advisors: Distributed, Multi-view Traffic Prediction for Smart Cities [Paper]
- Planning with diffusion for flexible behavior synthesis [Paper]
- Phenaki: Variable-length video generation from open domain textual description [Paper]
- Robonet: Large-scale multi-robot learning [Paper]
- GAIA-1: A Generative World Model for Autonomous Driving [Paper]
- Learning universal policies via text-guided video generation [Paper]
- Video language planning [Paper]
- Inner Monologue: Embodied Reasoning through Planning with Language Models [Paper][Project]
- Statler: State-Maintaining Language Models for Embodied Reasoning [Paper][Project]
- EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought [Paper][Project]
- MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge [Paper][Code]
- Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos [Paper]
- Open-World Multi-Task Control Through Goal-Aware Representation Learning and Adaptive Horizon Prediction [Paper][Code]
- Describe, explain, plan and select: interactive planning with LLMs enables open-world multi-task agents [Paper][Code]
- Voyager: An Open-Ended Embodied Agent with Large Language Models [Paper][Project][Code]
- Ghost in the Minecraft: Generally Capable Agents for Open-World Environments via Large Language Models with Text-based Knowledge and Memory [Paper][Project]
- Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents [Paper][Project][Code]
- GROOT: Learning to Follow Instructions by Watching Gameplay Videos [Paper][Project][Code]
- JARVIS-1: Open-World Multi-task Agents with Memory-Augmented Multimodal Language Models [Paper][Project][Code]
- SQA3D: Situated Question Answering in 3D Scenes [Paper][Project][Code]
- Generative Agents: Interactive Simulacra of Human Behavior [Paper]
- Towards Generalist Robots: A Promising Paradigm via Generative Simulation [Paper]
- A generalist agent [Paper]
- An Embodied Generalist Agent in 3D World [Paper][Project][Code]
- Gibson Env: real-world perception for embodied agents [Paper]
- iGibson 2.0: Object-Centric Simulation for Robot Learning of Everyday Household Tasks [Paper][Project]
- BEHAVIOR-1k: A benchmark for embodied AI with 1,000 everyday activities and realistic simulation [Paper][Project]
- Habitat: A Platform for Embodied AI Research [Paper][Project]
- Habitat 2.0: Training home assistants to rearrange their habitat [Paper]
- Robothor: An open simulation-to-real embodied ai platform [Paper][Project]
- VirtualHome: Simulating Household Activities via Programs [Paper]
- ARNOLD: A Benchmark for Language-Grounded Task Learning With Continuous States in Realistic 3D Scenes [Paper][Project][Code]