Awesome-Robotics-Foundation-Models

This is the partner repository for the survey paper "Foundation Models in Robotics: Applications, Challenges, and the Future". The authors hope this repository can act as a quick reference for roboticists who wish to read the relevant papers and implement the associated methods. The organization of this readme follows Figure 1 in the paper (shown above) and is thus divided into foundation models that have been applied to robotics and those that are relevant to robotics in some way.

We welcome contributions to this repository to add more resources. Please submit a pull request if you want to contribute!

Survey
Robotics
Robot Policy Learning for Decision Making and Controls
Language-Image Goal-Conditioned Value Learning
Robot Task Planning Using Large Language Models
Robot Transformers
In-context Learning for Decision-Making
Open-Vocabulary Robot Navigation and Manipulation
Relevant to Robotics
Open-Vocabulary Object Detection and 3D Classification
Open-Vocabulary Semantic Segmentation
Open-Vocabulary 3D Scene Representations
Open-Vocabulary Object Representations
Affordance Information
Predictive Models
Generalist AI
Simulators

Survey

This repository is largely based on the following paper:

Foundation Models in Robotics: Applications, Challenges, and the Future
Roya Firoozi Jiankai Sun, Johnathan Tucker, Anirudha Majumdar, Yuke Zhu, Shuran Song, Ashish Kapoor, Weiyu Liu, Stephen Tian, Karol Hausman, Brian Ichter, Danny Driess, Jiajun Wu, Cewu Lu, Mac Schwager

If you find this repository helpful, please consider citing:

Robotics

Robot Policy Learning for Decision-Making and Controls

Language-Conditioned Imitation Learning

CLIPort: What and Where Pathways for Robotic Manipulation [Paper][Project][Code]
Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation [Paper][Project][Code]
Play-LMP: Learning Latent Plans from Play [Project]
Multi-Context Imitation: Language-Conditioned Imitation Learning over Unstructured Data [Project]

Language-Assisted Reinforcement Learning

Towards A Unified Agent with Foundation Models [Paper]
Reward Design with Language Models [Paper]
Learning to generate better than your llm [Paper][Code]
Guiding Pretraining in Reinforcement Learning with Large Language Models [Paper][Code]

Language-Image Goal-Conditioned Value Learning

SayCan: Do As I Can, Not As I Say: Grounding Language in Robotic Affordances [Paper][Project][Code]
Zero-Shot Reward Specification via Grounded Natural Language [Paper]
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models [Project]
VIP: Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training [Paper][Project]
LIV: Language-Image Representations and Rewards for Robotic Control [Paper][Project]
LOReL: Learning Language-Conditioned Robot Behavior from Offline Data and Crowd-Sourced Annotation [Paper][Project]
Text2Motion: From Natural Language Instructions to Feasible Plans [Paper][Project]

Robot Task Planning Using Large Language Models

Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents [Paper][Project]
NL2TL: Transforming Natural Languages to Temporal Logics using Large Language Models [Paper][Project][Code]
AutoTAMP: Autoregressive Task and Motion Planning with LLMs as Translators and Checkers[Paper][Project]
LATTE: LAnguage Trajectory TransformEr [Paper][Code]
Planning with Large Language Models via Corrective Re-prompting [Paper]
Describe, explain, plan and select: interactive planning with LLMs enables open-world multi-task agents [Paper][Code]
JARVIS-1: Open-World Multi-task Agents with Memory-Augmented Multimodal Language Models [Paper][Project][Code]
An Embodied Generalist Agent in 3D World [Paper][Project][Code]

LLM-Based Code Generation

ProgPrompt: Generating Situated Robot Task Plans using Large Language Models [Paper][Project]
Code as Policies: Language Model Programs for Embodied Control [Paper][Project]
ChatGPT for Robotics: Design Principles and Model Abilities [Paper][Project][Code]
Voyager: An Open-Ended Embodied Agent with Large Language Models [Paper][Project]
Visual Programming: Compositional visual reasoning without training [Paper][Project][Code]

Robot Transformers

MotionGPT: Finetuned LLMs are General-Purpose Motion Generators [Paper][Project]
RT-1: Robotics Transformer for Real-World Control at Scale [Paper][Project][Code]
Masked Visual Pre-training for Motor Control [Paper][Project][Code]
Real-world robot learning with masked visual pre-training [Paper][Project]
Rt-2: Vision-language-action models transfer web knowledge to robotic control [Paper][Project]
PACT: Perception-Action Causal Transformer for Autoregressive Robotics Pre-Training [Paper]
GROOT: Learning to Follow Instructions by Watching Gameplay Videos [Paper][Project][Code]

In-context Learning for Decision-Making

A Survey on In-context Learning [Paper]
Large Language Models as General Pattern Machines [Paper]
Chain-of-Thought Predictive Control [Paper]
ReAct: Synergizing Reasoning and Acting in Language Models [Paper]

Open-Vocabulary Robot Navigation and Manipulation

CoWs on PASTURE: Baselines and Benchmarks for Language-Driven Zero-Shot Object Navigation [Paper][Project][Code]
LSC: Language-guided Skill Coordination for Open-Vocabulary Mobile Pick-and-Place [Paper][Project]
L3MVN: Leveraging Large Language Models for Visual Target Navigation [Project]
Open-World Object Manipulation using Pre-trained Vision-Language Models [Paper][Project]
VIMA: General Robot Manipulation with Multimodal Prompts [Paper][Project][Code]
Diffusion-based Generation, Optimization, and Planning in 3D Scenes [Paper][Project][Code]

Relevant to Robotics (Perception)

Open-Vocabulary Object Detection and 3D Classification

Simple Open-Vocabulary Object Detection with Vision Transformers [Paper][Code]
Grounded Language-Image Pre-training [Paper][Code]
Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection [Paper][Code]
PointCLIP: Point Cloud Understanding by CLIP [Paper][Code]
Point-bert: Pre-training 3d point cloud transformers with masked point modeling [Paper][Code]
ULIP: Learning a Unified Representation of Language, Images, and Point Clouds for 3D Understanding [Paper][Project][Code]
Ulip-2: Towards scalable multimodal pre-training for 3d understanding [Paper][Code]
3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment [Paper][Project][Code]

Open-Vocabulary Semantic Segmentation

Language-driven Semantic Segmentation [Paper][Code]
Emerging Properties in Self-Supervised Vision Transformers [Paper][Code]
Segment Anything [Paper][Project]
Fast segment anything [Paper][Code]
Faster Segment Anything: Towards Lightweight SAM for Mobile Applications [Paper][Code]
Track anything: Segment anything meets videos [Paper][Code]

Open-Vocabulary 3D Scene Representations

Clip-NeRF: Text-and-image driven manipulation of neural radiance fields [Paper][Project]
LERF: Language Embedded Radiance Fields [Paper][Project][Code]
Decomposing nerf for editing via feature field distillation [Paper][Project]

Object Representations

Neural Descriptor Fields: SE(3)-Equivariant Object Representations for Manipulation [Paper][Project]
Distilled Feature Fields Enable Few-Shot Language-Guided Manipulation [Paper][Project]
You Only Look at One: Category-Level Object Representations for Pose Estimation From a Single Example [Paper]
Zero-Shot Category-Level Object Pose Estimation [Paper][Code]

Affordance Information

Affordance Diffusion: Synthesizing Hand-Object Interactions [Paper][Project]
Affordances from Human Videos as a Versatile Representation for Robotics [Paper][Project]

Predictive Models

Adversarial Inverse Reinforcement Learning With Self-Attention Dynamics Model [Paper]
Connected Autonomous Vehicle Motion Planning with Video Predictions from Smart, Self-Supervised Infrastructure [Paper]
Self-Supervised Traffic Advisors: Distributed, Multi-view Traffic Prediction for Smart Cities [Paper]
Planning with diffusion for flexible behavior synthesis [Paper]
Phenaki: Variable-length video generation from open domain textual description [Paper]
Robonet: Large-scale multi-robot learning [Paper]
GAIA-1: A Generative World Model for Autonomous Driving [Paper]
Learning universal policies via text-guided video generation [Paper]
Video language planning [Paper]

Relevant to Robotics (Embodied AI)

Inner Monologue: Embodied Reasoning through Planning with Language Models [Paper][Project]
Statler: State-Maintaining Language Models for Embodied Reasoning [Paper][Project]
EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought [Paper][Project]
MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge [Paper][Code]
Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos [Paper]
Open-World Multi-Task Control Through Goal-Aware Representation Learning and Adaptive Horizon Prediction [Paper][Code]
Describe, explain, plan and select: interactive planning with LLMs enables open-world multi-task agents [Paper][Code]
Voyager: An Open-Ended Embodied Agent with Large Language Models [Paper][Project][Code]
Ghost in the Minecraft: Generally Capable Agents for Open-World Environments via Large Language Models with Text-based Knowledge and Memory [Paper][Project]
Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents [Paper][Project][Code]
GROOT: Learning to Follow Instructions by Watching Gameplay Videos [Paper][Project][Code]
JARVIS-1: Open-World Multi-task Agents with Memory-Augmented Multimodal Language Models [Paper][Project][Code]
SQA3D: Situated Question Answering in 3D Scenes [Paper][Project][Code]

Generalist AI

Generative Agents: Interactive Simulacra of Human Behavior [Paper]
Towards Generalist Robots: A Promising Paradigm via Generative Simulation [Paper]
A generalist agent [Paper]
An Embodied Generalist Agent in 3D World [Paper][Project][Code]

Simulators

Gibson Env: real-world perception for embodied agents [Paper]
iGibson 2.0: Object-Centric Simulation for Robot Learning of Everyday Household Tasks [Paper][Project]
BEHAVIOR-1k: A benchmark for embodied AI with 1,000 everyday activities and realistic simulation [Paper][Project]
Habitat: A Platform for Embodied AI Research [Paper][Project]
Habitat 2.0: Training home assistants to rearrange their habitat [Paper]
Robothor: An open simulation-to-real embodied ai platform [Paper][Project]
VirtualHome: Simulating Household Activities via Programs [Paper]
ARNOLD: A Benchmark for Language-Grounded Task Learning With Continuous States in Realistic 3D Scenes [Paper][Project][Code]

buoyancy99/Awesome-Robotics-Foundation-Models