Multi-Objective Reinforement Learning

Introduction

RL problem

The learning agent is not provided explicitly what action to take
The learning agent determines the best action to maximize long-term rewards and execute it
The selected action makes the current state of the environment to transit into its successive state
The agent receives a scalar reward signal that evaluates the effect of this state transition
The agent learns optimal or near-optimal action policies from such interactions in order to maximize some notion of long-term objectives

Challenges of RL

Despite of many advances in RL theory and algorithms, one remained challenge is to scale up to larger and more complex problems. The scaling problem for sequential decision-making mainly includes the following aspects:

Large or continuous state or action space
Hierarchically organized tasks and sub-tasks
To solve several tasks with different rewards simultaneously

Multi-objective reinforcement learning (MORL) problem

Multi-Objective Reinforcement Learning (MORL)

A combination of multi-objective optimization (MOO) and RL techniques to solve the sequential decision-making problems with multiple conflicting objectives

Obtain action policies which optimizes two or more objectives at the same time
Each objective has its own associated reward signal
The reward is not a scalar value but a vector
Combine the objectives if they are related
Optimize the objectives separately if they are completely unrelated
Make a trade-off among the conflicting objectives

Multi-Objective Optimization (MOO) Strategies

Multi-objective to Single-objective Strategy

To optimize a scalar value
- Weighted sum method
- Constraint method
- Sequential method
- Max-min method

Pareto Strategy

Vector-valued utilities
Non-inferior and alternative solutions
Constitute the Pareto front

MORL algorithms

Single-policy Approaches

Find the best single policy specified by a user or derived from the problem domain

Multiple-policy Approaches

Find a set of policies that approximate the Pareto front

Backgrounds

Markov Decision Process (MDP) Models

A sequential decision-making problem can be formulated as an MDP

S: The state space of a finite set of states

A: The action space of a finite set of actions

R: The reward function

P: The matrix of state transition probability

Objective functions

Discounted reward criteria
Average reward criteria

MDP Objective Functions

Discounted Reward Criteria

Average Reward Criteria

Basic RL Algorithms

RL algorithms integrate the techniques of Monte Carlo, stochastic approximation, and function approximation to obtain approximate solutions of MDPs
As a central mechanism of RL, temporal-difference (TD) learning can be viewed as a combination of Monte Carlo and DP
TD algorithms can learn the value functions using state transition data without model information
- Similar to Monte Carlo methods
TD methods can update the current estimation of value functions partially based on previous learned results
- Similar to DP

Discounted Reward Criteria Q-Learning Algorithm

Average Reward Criteria R-Learning Algorithm

MOO Problems

Maximize the Pareto optimality or the weighted scalar of all the elements and satisfy the constraint functions

MOO Optimal Solutions

Multi-objective to Single-objective Strategy

Solutions can be obtained by solving a SOO problem

Pareto Dominance and Pareto Front

Find all the dominating solutions instead of the dominated ones Practically, find a set of solutions that approximates the real Pareto front

MORL Problem

Basic Architecture

Maximize the Pareto optimality or the weighted scalar of all the elements and satisfy the constraint functions

Vectored state-action value function

Major Research Topics

MORL is a highly interdisciplinary field and it refers to the integration of MOO methods and RL techniques to solve sequential decision making problems with multiple conflicting objectives
The related disciplines of MORL include artificial intelligence, decision and optimization theory, operations research, control theory, and so on
MORL suitably represents the designer’s preferences or ensure the optimization priority with some policies in the Pareto front
Design efficient MORL algorithms

Representative Approaches to MORL

MORL Approaches

Single-policy Approaches

Weighted sum approach
W-learning
Analytic hierarchy process (AHP) approach
Ranking approach
Geometric approach

Multiple-policy Approaches

Convex hull approach
Varying parameter approach

Weighted Sum Approach

GM-Sarsa(0)

Sum up the Q-values for all the objectives to estimate the combined Q-function
The updates are based on the actually selected actions rather than the best action determined by the value function
Has smaller errors between the estimated Q-values and the true Q-values

Weighted sum approach

W-Learning Approach

Top-Q method to compute W values

Assign the W value as the highest Q-value among all the objectives in the current state

Synthetic objective function for the Top-Q approach

The objective with the highest Q-value may have similar priorities for different actions, while other objectives cannot be satisfied due to their low action values
A change in reward scaling or the design of reward functions may greatly influence the results of the winner-take-all contest

W-Learning

All the W values, except the highest W value, are updated after the action of each step is selected and executed

Negotiated W-Learning

Explicitly find that if an objective is not
preferred to determine the next action
Might lose the most long-term reward

Analytic Hierarchy Process (AHP) Approach

The designer of MORL algorithms may not have enough prior knowledge about the optimization problem
The degree of relative importance between two objectives can be quantified by L grades, and a scalar value is defined for each grade
Requires a lot of prior knowledge of the problem domain

Relative importance matrix

Importance factor

Value of improvement

Fuzzy inference system: compute the goodness of 𝑎𝑝 relative to 𝑎𝑞

Ranking Approach

Also called the sequential approach or the threshold approach
Ensure the effectiveness of the subordinate objective
Threshold values were specified for some objectives in order to put the constraints on the objectives

Geometric Approach

Deal with dynamic unknown Markovian environments with long-term average reward vectors
Assume that actions of other agents may influence the dynamics of the environment and the game is irreducible or ergodic

Multiple directions RL (MDRL) and single direction RL (SDRL)

Approximate a desired target set in a multidimensional objective space

Convex Hull Approach

Simultaneously learn optimal policies for all linear preference assignments in the objective space
Can find the optimal policy for any linear preference function
Since multiple policies are learned at once, the integrated RL algorithms should be off-policy algorithms

Definition 1: Translation and scaling operations

Definition 2: Summing two convex hulls

Varying Parameter Approach

A multiple-policy approach can be realized by performing multiple runs with different parameters, objective thresholds, and orderings in any single-policy algorithm

Policy gradient methods and the idea of varying parameters

Estimate multiple policy gradients for each objective
Vary the weights of the objective gradients to find multiple policies

Summary

Multi-objective fitted Q-iteration (FQI)

Find control policies for all the linear combinations of preferences assigned to the objectives in a single training procedure

Important Directions of Recent Research on MORL

Further Development of MORL Approaches

To obtain suitable representations of the preferences and improve the efficiency of MORL algorithms

Estimation of distribution algorithms (EDA)
- Incorporate the notions in evolutionary MOO
- Acquire various strategies by a single run
Learning classifier system
- The choice of action-selection policies can greatly affect the performance of the learning system
α-domination strategy
- Use a goal-directed bias based on the achievement level of each evaluation
Parallel genetic algorithm (PGA)
- Evolve a neuro-controller
- Perturbation stochastic approximation (SPSA) was used to improve the convergence
Adaptive margins

To obtain Pareto optimal policies in large or continuous spaces

Multi-agent framework
- ie. Traffic signal control
Fitted Q-iteration (FQI)
- Approximate the Pareto front
Consistency multi-objective dynamic programming

The small scale of previous MORL problems may not verify the algorithm’s performance in dealing with a wide range of different problem settings, and the algorithm implementations always require much prior knowledge about the problem domain

kevin880987/Multi-Objective-Reinforement-Learning