Real World Reinforcement Learning in Deployment

The repository compiles a list of real-world applications of reinforcement learning.

Only include methods that was deployed, is currently deployed, or will be deployed in the future.
Exclude RL applications to games and robotics where experiments were only done in simulation.
Only include publicly available information.

The repository also aggregates information from several sources, including

We categorize RL applications based on the deployment status (e.g., currently deployed, deployed at least once/for some time, planned to be deployed, or unknown), and the approaches to solve the problems (e.g., online, offline, train with simulators, search with simulators, using offline data to build partial simulators).

Industrial Control
Energy Control
Control of Physical Systems
Large Language Models & Conversational Systems
Other Applications without Deployment
Real world gym
Open Source Software
Other Resources

Industrial Control

AMII has been applying RL for water treatment plant.
Link: Blog Post
Deployment status: Planned to be deployed

Google Deepmind used RL to improve the energy efficiency of heating, ventilation and air conditioning (HVAC) control.
Link: Paper2022, NeurIPS2018
Deployment status: Deployed at least once/for some time. Experiments were done in real world facilities.
Approach: Online
Algorithm: Policy iteration, with value function estimated from offline data

Telus used RL to reduce energy consumption for data centers.
Link: Presentation, Announcement
Deployment status: Unknown
Approach: Train with simulators

Foobot used deep RL for HVAC optimization.
Link: Blog
Deployment status: Currently deployed (based on the information here)
Approach: Train with simulators
Difficulty: High dimensional action spaces
Algorithm: PPO with autoregressive policies

NVIDIA used RL for data center congestion control.
Link: Paper2023
Deployment status: Experiments were done in real world system.
Approach: Train with simulators
Difficulty: Constraints on low memory and low inference time, multi-agent POMDP
Algorithm: Policy gradient with LSTM layers -> distill to lightweight decision trees

Siemens Technology has been working on industrial applications of RL.
Link: Video
Deployment status: unknown

Phaidra has been working on using deep RL to improve plant stability and energy efficiency.
Link: Website, Technical Report
Deployment status: unknown

Microsoft Project Bonsai used RL for industrial control systems
Link: Website, Report
Deployment status: unknown
Approach: Train with simulators

Energy Control

Deepmind successfully controlling the nuclear fusion plasma in a tokamak with deep reinforcement learning.
Link: Nature2022, Post
Deployment status: Real-world experiments on TCV (an experimental tokamak)
Approach: Train with simulators
Algorithm: MPO (four-layer neural network for the actor, larger RNNs for the critic)

DeepThermal uses model-based offline RL to optimize the combustion efficiency of a thermal power generating unit.
Link: AAAI2022
Deployment status: Currently deployed (deployed in four large coal-fired thermal power plants in China)
Approach: Offline
Algorithm: offline model learning using LSTM + offline actor-critic with reward penalty

Control of Physical Systems

Google and Loon used RL to control a superpressure balloon in the stratosphere.
Link: Nature2020
Deployment status: Currently deployed
Approach: Using offline data to build partial simulators (wind simulation based on historical data)
Difficulty: Partial observability
Algorithm: Incorporate uncertainty estimates as additional inputs, QR-DQN with a seven-layers Relu network + parallel simulation

Swift achieved champion-level performance in drone racing.
Link: Nature2023
Deployment status: Deployed at least once/for some time (won several races against human champions)
Approach: Train with simulators + fine-tune by collecting more real-world data
Difficulty: Optimizing a policy purely in simulation yields poor performance on physical hardware
Algorithm: PPO + parallel simulation

Large Language Models & Conversational Systems

OpenAI used Reinforcement Learning from Human Feedback (RLHF) for ChatGPT.
Link: Introducing ChatGPT, NeurIPS 2020
Deployment status: Currently deployed
Algorithm: PPO with learned reward models, penalizes the KL divergence between the RL policy and the original supervised model

Deepmind used RLHF for Sparrow.
Link: Sparrow, Blog
Deployment status: Unknown (the model was not released publicly)
Algorithm: A2C with learned reward models, penalizes the KL divergence between the fine-tuned policy and the initial teacher language model

Anthropic used Reinforcement Learning from AI Feedback for Claude.
Link: Constitutional AI, Iterated Online RLHF
Deployment status: Currently deployed
Algorithm: Preference labelling are done by an independent model (feedback model), instead of human. The remainder of the training pipeline is exactly the same as RLHF with PPO.

Meta used RLHF for Llama 2.
Link: Llama 2
Deployment status: Currently deployed
Algorithm: PPO with rejection sampling fine-tuning

Google developed a real-time and open-ended dialogue system using RL.
Link: Paper
Deployment status: Currently deployed in Google Assistant
Approach: Offline
Algorithm: Stochastic Action Q-learning & Continuous Action Q-learning & Conservative Q-learning

Recommendation

Yahoo (online bandits)

Azure AI Personalizer

Operation Research

Amazon inventory control
Link: Paper

Google Maps

Ridesharing

Finance

Accounting

IRS uses bandits for audit selection
Link: Paper

Chip Design

Compiler Optimization

Compiler Optimization
Memory mapping

Drug Discovery

Education

Healthcare

Machine Learning for Mechanical Ventilation Control https://arxiv.org/pdf/2111.10434.pdf

Sport

The Emirates Team New Zealand won the America’s Cup with the help of an RL agent.
Link: Presentation

Algorithm

Matrix multiplication

Video compression

Other Applications without Deployment

Apple used RL to learn a network defense policy
Link: Paper

Hewlett Packard Enterprise used RL to control Wave Energy Converters

Boeing used RL to optimize the obstacle avoidance policy.
Link: Paper

Real World Gym

SustainGym: Reinforcement Learning Environments for Sustainable Energy Systems
CybORG: A Gym for the Development of Autonomous Cyber Agents
DCRL-Green: Sustainable Data Center Environment and Benchmark for Multi-Agent Reinforcement Learning

Open Source Software

Pearl - A Production-ready Reinforcement Learning AI Agent Library from Meta
RLlib: Industry-Grade Reinforcement Learning
FinRL: Financial Reinforcement Learning
TRL: Transformer Reinforcement Learning from Hugging Face
RL4LMs: A modular RL library to fine-tune language models to human preferences from AI2

Other Resources

Blog Posts

Towards Deployable RL - What’s Broken with RL Research and a Potential Fix by Shie Mannor and Aviv Tamar
Don’t Panic! Reinforcement learning is full of magical things patiently waiting for our wits to grow sharper by Marlos C. Machado

Lectures

CMU Real World RL course by Emma Brunskill

Journal & Workshop

MLJ Special Issue on Reinforcement Learning for Real Life
Reinforcement Learning for Real Life Workshop @ NeurIPS and ICML

VincentLiu3/real-world-RL-deployment