Awesome Multi-Modal Reinforcement Learning
This is a collection of research papers for Multi-Modal reinforcement learning (MMRL). And the repository will be continuously updated to track the frontier of MMRL. Some papers may not be relevant to RL, but we include them anyway as they may be useful for the research of MMRL.
Welcome to follow and star!
Introduction
Multi-Modal RL agents focus on learning from video (images), language (text), or both, as humans do. We believe that it is important for intelligent agents to learn directly from images or text, since such data can be easily obtained from the Internet.
Papers
format:
- [title](paper link) [links]
- authors.
- key words.
- experiment environment.
-
MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge
- Linxi Fan, Guanzhi Wang, Yunfan Jiang, etc. ArXiv2022
- Key Words: multimodal dataset, MineClip
- ExpEnv: Minecraft
-
R3M: A Universal Visual Representation for Robot Manipulation
- Suraj Nair, Aravind Rajeswaran, Vikash Kumar, etc. ArXiv2022
- Key Words: Ego4D human video dataset, pre-train visual representation
- ExpEnv: MetaWorld, Franka Kitchen, Adroit
-
SOAT: A Scene-and Object-Aware Transformer for Vision-and-Language Navigation
- Abhinav Moudgil, Arjun Majumdar,Harsh Agrawal, etc. NeurIPS2021
- Key Words: Vision-and-Language Navigation
- ExpEnv: Room-to-Room, Room-Across-Room
-
Mastering Atari with Discrete World Models
- Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, etc. ICLR2021
- Key Words: World models
- ExpEnv: Atari
-
Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos
- Bowen Baker, Ilge Akkaya, Peter Zhokhov, etc. ArXiv2022
- Key Words: Inverse Dynamics Model
- ExpEnv: minerl
-
Offline Reinforcement Learning from Images with Latent Space Models
- Rafael Rafailov, Tianhe Yu, Aravind Rajeswaran, etc. L4DC2021
- Key Words: Latent Space Models
- ExpEnv: DeepMind Control, Adroit Pen, Sawyer Door Open, Robel D’Claw Screw
-
Pretraining Representations for Data-Efficient Reinforcement Learning
- Max Schwarzer, Nitarshan Rajkumar, Michael Noukhovitch, etc. NeurIPS2021
- Key Words: latent dynamics modelling, unsupervised RL
- ExpEnv: Atari
-
LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action
- Dhruv Shah, Blazej Osinski, Brian Ichter, etc. ArXiv2022
- Key Words: CLIP, ViNG, GPT-3
- ExpEnv: None
-
Learning Generalizable Robotic Reward Functions from “In-The-Wild” Human Videos
- Annie S. Chen, Suraj Nair, Chelsea Finn. RSS2021
- Key Words: Reward Functions, “In-The-Wild” Human Videos
- ExpEnv: None
-
Reinforcement Learning with Videos: Combining Offline Observations with Interaction
- Karl Schmeckpeper, Oleh Rybkin, Kostas Daniilidis,etc. CoRL2020
- Key Words: learning from videos
- ExpEnv: robotic pushing task, Meta-World
-
Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents
- Wenlong Huang, Pieter Abbeel, Deepak Pathak, etc. ICML2022
- Key Words: large language models, Embodied Agents
- ExpEnv: VirtualHome
-
Reinforcement Learning with Action-Free Pre-Training from Videos
- Younggyo Seo, Kimin Lee, Stephen L James, etc. ICML2022
- Key Words: action-free pretraining, videos
- ExpEnv: Meta-world, DeepMind Control Suite
-
History Compression via Language Models in Reinforcement Learning
-
Learning Actionable Representations with Goal-Conditioned Policies
- Dibya Ghosh, Abhishek Gupta, Sergey Levine. ICLR2019
- Key Words: Actionable Representations Learning
- ExpEnv: 2D navigation(2D Wall, 2D Rooms, Wheeled, Wheeled Rooms, Ant, Pushing)
-
Is Cross-Attention Preferable to Self-Attention for Multi-Modal Emotion Recognition?
- Vandana Rajan, Alessio Brutti, Andrea Cavallaro. ICASSP2022
- Key Words: Multi-Modal Emotion Recognition, Cross-Attention
- ExpEnv: None
-
How Much Can CLIP Benefit Vision-and-Language Tasks?
- Sheng Shen, Liunian Harold Li, Hao Tan, etc. ICLR2022
- Key Words: Vision-and-Language, CLIP
- ExpEnv: None
-
Grounding Language to Entities and Dynamics for Generalization in Reinforcement Learning
- Austin W. Hanjie, Victor Zhong, Karthik Narasimhan. ICML2021
- Key Words: Multi-modal Attention
- ExpEnv: Messenger
-
Learning Latent Dynamics for Planning from Pixels
- Danijar Hafner, Timothy Lillicrap, Ian Fischer, etc. ICML2019
- Key Words: latent dynamics model, pixel observations
- ExpEnv: DeepMind Control Suite
-
Decoupling Representation Learning from Reinforcement Learning
- Adam Stooke,Kimin Lee,Pieter Abbeel, etc. ICML2021
- Key Words: representation learning, unsupervised learning
- ExpEnv: DeepMind Control, Atari, DMLab
-
Masked Visual Pre-training for Motor Control
- Tete Xiao, Ilija Radosavovic, Trevor Darrell, etc. ArXiv2022
- Key Words: self-supervised learning, motor control
- ExpEnv: Isaac Gym
Contributing
Our purpose is to make this repo even better. If you are interested in contributing, please refer to HERE for instructions in contribution.
License
Awesome Multi-Modal Reinforcement Learning is released under the Apache 2.0 license.