Awesome Multi-Modal Reinforcement Learning

This is a collection of research papers for Multi-Modal reinforcement learning (MMRL). And the repository will be continuously updated to track the frontier of MMRL. Some papers may not be relevant to RL, but we include them anyway as they may be useful for the research of MMRL.

Welcome to follow and star!

Introduction

Multi-Modal RL agents focus on learning from video (images), language (text), or both, as humans do. We believe that it is important for intelligent agents to learn directly from images or text, since such data can be easily obtained from the Internet.

Papers

format:
- [title](paper link) [links]
  - authors.
  - key words.
  - experiment environment.

MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge
- Linxi Fan, Guanzhi Wang, Yunfan Jiang, etc. ArXiv2022
- Key Words: multimodal dataset, MineClip
- ExpEnv: Minecraft
R3M: A Universal Visual Representation for Robot Manipulation
- Suraj Nair, Aravind Rajeswaran, Vikash Kumar, etc. ArXiv2022
- Key Words: Ego4D human video dataset, pre-train visual representation
- ExpEnv: MetaWorld, Franka Kitchen, Adroit
SOAT: A Scene-and Object-Aware Transformer for Vision-and-Language Navigation
- Abhinav Moudgil, Arjun Majumdar,Harsh Agrawal, etc. NeurIPS2021
- Key Words: Vision-and-Language Navigation
- ExpEnv: Room-to-Room, Room-Across-Room
Recurrent World Models Facilitate Policy Evolution
- David Ha, Jürgen Schmidhuber. NeurIPS2018
- Key Words: World model, generative RNN, VAE
- ExpEnv: VizDoom, CarRacing
Mastering Atari with Discrete World Models
- Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, etc. ICLR2021
- Key Words: World models
- ExpEnv: Atari
Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos
- Bowen Baker, Ilge Akkaya, Peter Zhokhov, etc. ArXiv2022
- Key Words: Inverse Dynamics Model
- ExpEnv: minerl
Offline Reinforcement Learning from Images with Latent Space Models
- Rafael Rafailov, Tianhe Yu, Aravind Rajeswaran, etc. L4DC2021
- Key Words: Latent Space Models
- ExpEnv: DeepMind Control, Adroit Pen, Sawyer Door Open, Robel D’Claw Screw
Pretraining Representations for Data-Efﬁcient Reinforcement Learning
- Max Schwarzer, Nitarshan Rajkumar, Michael Noukhovitch, etc. NeurIPS2021
- Key Words: latent dynamics modelling, unsupervised RL
- ExpEnv: Atari
LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action
- Dhruv Shah, Blazej Osinski, Brian Ichter, etc. ArXiv2022
- Key Words: CLIP, ViNG, GPT-3
- ExpEnv: None
Learning Generalizable Robotic Reward Functions from “In-The-Wild” Human Videos
- Annie S. Chen, Suraj Nair, Chelsea Finn. RSS2021
- Key Words: Reward Functions, “In-The-Wild” Human Videos
- ExpEnv: None
Reinforcement Learning with Videos: Combining Ofﬂine Observations with Interaction
- Karl Schmeckpeper, Oleh Rybkin, Kostas Daniilidis,etc. CoRL2020
- Key Words: learning from videos
- ExpEnv: robotic pushing task, Meta-World
Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents
- Wenlong Huang, Pieter Abbeel, Deepak Pathak, etc. ICML2022
- Key Words: large language models, Embodied Agents
- ExpEnv: VirtualHome
Reinforcement Learning with Action-Free Pre-Training from Videos
- Younggyo Seo, Kimin Lee, Stephen L James, etc. ICML2022
- Key Words: action-free pretraining, videos
- ExpEnv: Meta-world, DeepMind Control Suite
History Compression via Language Models in Reinforcement Learning
- Fabian Paischer, Thomas Adler, Vihang Patil, etc. ICML2022
- Key Words: Pretrained Language Transformer
- ExpEnv: Minigrid, Procgen
Learning Actionable Representations with Goal-Conditioned Policies
- Dibya Ghosh, Abhishek Gupta, Sergey Levine. ICLR2019
- Key Words: Actionable Representations Learning
- ExpEnv: 2D navigation(2D Wall, 2D Rooms, Wheeled, Wheeled Rooms, Ant, Pushing)
Is Cross-Attention Preferable to Self-Attention for Multi-Modal Emotion Recognition?
- Vandana Rajan, Alessio Brutti, Andrea Cavallaro. ICASSP2022
- Key Words: Multi-Modal Emotion Recognition, Cross-Attention
- ExpEnv: None
How Much Can CLIP Benefit Vision-and-Language Tasks?
- Sheng Shen, Liunian Harold Li, Hao Tan, etc. ICLR2022
- Key Words: Vision-and-Language, CLIP
- ExpEnv: None
Grounding Language to Entities and Dynamics for Generalization in Reinforcement Learning
- Austin W. Hanjie, Victor Zhong, Karthik Narasimhan. ICML2021
- Key Words: Multi-modal Attention
- ExpEnv: Messenger
Learning Latent Dynamics for Planning from Pixels
- Danijar Hafner, Timothy Lillicrap, Ian Fischer, etc. ICML2019
- Key Words: latent dynamics model, pixel observations
- ExpEnv: DeepMind Control Suite
Decoupling Representation Learning from Reinforcement Learning
- Adam Stooke,Kimin Lee,Pieter Abbeel, etc. ICML2021
- Key Words: representation learning, unsupervised learning
- ExpEnv: DeepMind Control, Atari, DMLab
Masked Visual Pre-training for Motor Control
- Tete Xiao, Ilija Radosavovic, Trevor Darrell, etc. ArXiv2022
- Key Words: self-supervised learning, motor control
- ExpEnv: Isaac Gym

Contributing

Our purpose is to make this repo even better. If you are interested in contributing, please refer to HERE for instructions in contribution.

License

Awesome Multi-Modal Reinforcement Learning is released under the Apache 2.0 license.

zcchenvy/awesome-multi-modal-reinforcement-learning

Awesome Multi-Modal Reinforcement Learning

Introduction

Papers

Contributing

License