Reinforcement Learning (RL)

This repository focuses on Reinforcement Learning related concepts, use cases, point of views and learning approaches. These are purely based on my learnings, readings, experiences in dealing with practical / real-life context and scenarios.

Structure of Repository

01_Introduction covers Key Terms used in RL, Basic elements, Concepts/Topics around RL etc.
02_MAB covers Multi-Armed Bandit Problem area
03_Monte Carlo Methods covers Monte Carlo Methods
04_Temporal Difference Learning covers TD Methods
05_Dynamic Programming covers Dynamic Programming
06_Approximation covers Online Prediction with Approximation
07_PolicyGradientMethods covers Policy Gradient Methods
08_MDP covers Finite Markov Decision Processes

Areas covered

Multi-Armed Bandit Problems (MABP)
Finite Markov Decision Processes (MDP)
Dynamic Programming Methods
Monte Carlo Methods
Temporal Difference (TD) Learning
Tabular Solution Methods and Approximate Solution Methods
Policy Gradient Methods

How to be Successful in implementing RL

Figure out the Adoption factor and ensure "right" stakeholder blessings are met upfront
Identify "appropriate" business use case within the context of the industry / sub-industry / sub-segment: relevancy is a must
Identify compute costs upfront and put together a "short term" and "long term" ROI plan to track tasks and how it benefits: we also need to see a pattern of our outcomes so that we can re-adjust and tweak the strategy in the process to stay effective and stay successful
Focus on simulation method and see how we can strategy for multiple use cases / related use cases and not just one or two use cases

This is where the difference between LEADERS and LAGGARDS in this space !!

Use Cases (Non-exhaustive, for understanding purposes)

Use Case Theme	Description	Industry Relevancy	Category
Pricing and Promotion Analytics	Ability to apply advanced pricing and promotion strategies to improve product margins	Agriculture	Next Best Actions for Customer
Waste and Cost reduction	Optimize warehouse logistics and network for reduced waste and maintenance cost reduction	Agriculture	Optimize Complex Operations
Production Operations Management	Solving Scheduling and Production allocation challenges to optimize and improvise yield	Agriculture	Optimize Complex Operations
Optimization of Product Design Process	Ability to optimize product design processes to shorten development cycle for new vehicles, features and improvise quality	Automotive	Optimize Product Development Cycle / Design
Load Balancing	Ability to balance the load of electricity grids in a situation of varying demand cycles	Energy and Utilities	Optimize Complex Operations
Yield Optimization	Ability to enable real-time well monitoring and precision drilling for improved yield in Oil operations	Energy and Utilities	Optimize Complex Operations
Trading Strategy Optimization	Ability to optimize the trading strategy for an options-trading portfolio	Financial Services	Optimize Complex Operations
Customer HyperPersonalization	Delivering advanced personalization abilities that adapt promotions, next best offers and recommendations for increase customer satisfaction and increased sales	Financial Services	Next Best Actions for Customer
Clinical Trials	The well being of patients during clinical trials is extremely important along with the actual results of the study. In this scenario, the exploration is equivalent to identifying the best treatment, and exploitation is treating patients as effectively as possible during the trial process.	Life Sciences	Optimize Complex Operations
Effective Inventory Management with Robotics	Stock and pick inventory using Robots	Retail and CPG	Optimize Product Development Cycle / Design
Network Routing	Routing is the process of selecting a path for traffic in a network, such as telephone networks or computer networks (internet) etc. Allocation of channels to the right users, such that the overall throughput is maximised, can be formulated as a MABP.	Generic / Common	Optimize Product Development Cycle / Design
Online Advertising	The goal of an advertising campaign is to maximise revenue from displaying ads. The advertiser makes revenue every time an offer is clicked by a web user. Similar to MABP, there is a trade-off between exploration, where the goal is to collect information on an ad’s performance using click-through rates, and exploitation, where we stick with the ad that has performed the best so far.	Generic / Common	Next Best Actions for Customer

Other References:

10 real life problems
Applications in real world
RL Cheatsheets
bsuite - Behavior Suite for RL from DeepMind team
Actor-Critic Reinforcement Learning for Energy Optimization in Hybrid Production Environments - management and optimisation of energy flows. Everything from the world of power grids could be optimised with RL. From operations and maintenance of microgrids to the optimisation of emergency control procedures, RL could be applied to all the control flows. Heating, ventilation and air conditioning systems (HVAC) are another candidate for optimisation, as energy consumption is a huge cost factor for all industrial sites.

"Staying Current" in RL

There are 3 key aspects which are pertinent to greater control of RL algorithms and it's solving power:
- Design approach to see how rewards can be maximized when agent learns
- Importance and relevancy of the Learning environment
- Compute power which is significant where we look for approximation or linear/non-linear function approximations
Soft-actor critic algorithms are significantly increasing the training efficiency and decreasing compute costs
- Off Policy Maximum Entropy Deep RL with a stochastic actor
- Tuomas Haarnoja et al on other Soft-actor critic algorithms
Some of the Key cloud computing work that can be looked at:
- Microsoft Project Bonsai Here
- Google SEED-RL Here
- Amazon Sagemaker RL Here

Reference Materials

Book1: Richard Sutton and Barto
Book2: Neuro-dynamic Programming - by Dimitri Bertsekas and John Tsitsiklis, Book link from Amazon
Book3: DL by Ian Goodfellow et al
RL from Stanford: CS234
References from Denny Britz
RL Winter 2021 Stanford: Modules and Videos
UCL Course on RL
Common RL Examples on Sagemaker
Initial Part MABPs: Epsilon, epsilon-Greedy methods
Advanced MABPs: UCB Bandits, Gradient Bandits, Nonstationary Bandits
Intro RL
Top 10 Deep RL Papers in 2019 by Robert T Lange

Resources

FAQ

To setup and experiment on a cloud platform such as AWS
- Please setup an AWS Sagemaker account
- Ensure to have IAM User and Role setup appropriately for authentication and access control
- Establish an Amazon Sagemaker Notebook Instance
- Establish a S3 Bucket
Similarly it can be explored for IBM Watson / IBM Cloud OR GCP or Azure

kkm24132/ReinforcementLearning