- The learning agent is not provided explicitly what action to take
- The learning agent determines the best action to maximize long-term rewards and execute it
- The selected action makes the current state of the environment to transit into its successive state
- The agent receives a scalar reward signal that evaluates the effect of this state transition
- The agent learns optimal or near-optimal action policies from such interactions in order to maximize some notion of long-term objectives
Despite of many advances in RL theory and algorithms, one remained challenge is to scale up to larger and more complex problems. The scaling problem for sequential decision-making mainly includes the following aspects:
-
Large or continuous state or action space
-
Hierarchically organized tasks and sub-tasks
-
To solve several tasks with different rewards simultaneously
Multi-objective reinforcement learning (MORL) problem
A combination of multi-objective optimization (MOO) and RL techniques to solve the sequential decision-making problems with multiple conflicting objectives
- Obtain action policies which optimizes two or more objectives at the same time
- Each objective has its own associated reward signal
- The reward is not a scalar value but a vector
- Combine the objectives if they are related
- Optimize the objectives separately if they are completely unrelated
- Make a trade-off among the conflicting objectives
Multi-objective to Single-objective Strategy
Pareto Strategy
Single-policy Approaches
Find the best single policy specified by a user or derived from the problem domain
Multiple-policy Approaches
Find a set of policies that approximate the Pareto front
A sequential decision-making problem can be formulated as an MDP
S: The state space of a finite set of states
A: The action space of a finite set of actions
R: The reward function
P: The matrix of state transition probability
Objective functions
- Discounted reward criteria
- Average reward criteria
Discounted Reward Criteria
Average Reward Criteria
- RL algorithms integrate the techniques of Monte Carlo, stochastic approximation, and function approximation to obtain approximate solutions of MDPs
- As a central mechanism of RL, temporal-difference (TD) learning can be viewed as a combination of Monte Carlo and DP
- TD algorithms can learn the value functions using state transition data without model information
- Similar to Monte Carlo methods
- TD methods can update the current estimation of value functions partially based on previous learned results
- Similar to DP
Discounted Reward Criteria Q-Learning Algorithm
Average Reward Criteria R-Learning Algorithm
Maximize the Pareto optimality or the weighted scalar of all the elements and satisfy the constraint functions
Multi-objective to Single-objective Strategy
Solutions can be obtained by solving a SOO problem
Pareto Dominance and Pareto Front
Find all the dominating solutions instead of the dominated ones Practically, find a set of solutions that approximates the real Pareto front
Maximize the Pareto optimality or the weighted scalar of all the elements and satisfy the constraint functions
Vectored state-action value function
- MORL is a highly interdisciplinary field and it refers to the integration of MOO methods and RL techniques to solve sequential decision making problems with multiple conflicting objectives
- The related disciplines of MORL include artificial intelligence, decision and optimization theory, operations research, control theory, and so on
- MORL suitably represents the designer’s preferences or ensure the optimization priority with some policies in the Pareto front
- Design efficient MORL algorithms
Single-policy Approaches
- Weighted sum approach
- W-learning
- Analytic hierarchy process (AHP) approach
- Ranking approach
- Geometric approach
Multiple-policy Approaches
- Convex hull approach
- Varying parameter approach
GM-Sarsa(0)
- Sum up the Q-values for all the objectives to estimate the combined Q-function
- The updates are based on the actually selected actions rather than the best action determined by the value function
- Has smaller errors between the estimated Q-values and the true Q-values
Weighted sum approach
Top-Q method to compute W values
- Assign the W value as the highest Q-value among all the objectives in the current state
- Synthetic objective function for the Top-Q approach
- The objective with the highest Q-value may have similar priorities for different actions, while other objectives cannot be satisfied due to their low action values
- A change in reward scaling or the design of reward functions may greatly influence the results of the winner-take-all contest
W-Learning
- All the W values, except the highest W value, are updated after the action of each step is selected and executed
Negotiated W-Learning
- Explicitly find that if an objective is not
- preferred to determine the next action
- Might lose the most long-term reward
- The designer of MORL algorithms may not have enough prior knowledge about the optimization problem
- The degree of relative importance between two objectives can be quantified by L grades, and a scalar value is defined for each grade
- Requires a lot of prior knowledge of the problem domain
Relative importance matrix
Importance factor
Value of improvement
Fuzzy inference system: compute the goodness of 𝑎𝑝 relative to 𝑎𝑞
- Also called the sequential approach or the threshold approach
- Ensure the effectiveness of the subordinate objective
- Threshold values were specified for some objectives in order to put the constraints on the objectives
- Deal with dynamic unknown Markovian environments with long-term average reward vectors
- Assume that actions of other agents may influence the dynamics of the environment and the game is irreducible or ergodic
Multiple directions RL (MDRL) and single direction RL (SDRL)
Approximate a desired target set in a multidimensional objective space
- Simultaneously learn optimal policies for all linear preference assignments in the objective space
- Can find the optimal policy for any linear preference function
- Since multiple policies are learned at once, the integrated RL algorithms should be off-policy algorithms
Definition 1: Translation and scaling operations
Definition 2: Summing two convex hulls
- A multiple-policy approach can be realized by performing multiple runs with different parameters, objective thresholds, and orderings in any single-policy algorithm
Policy gradient methods and the idea of varying parameters
- Estimate multiple policy gradients for each objective
- Vary the weights of the objective gradients to find multiple policies
Multi-objective fitted Q-iteration (FQI)
Find control policies for all the linear combinations of preferences assigned to the objectives in a single training procedure
To obtain suitable representations of the preferences and improve the efficiency of MORL algorithms
- Estimation of distribution algorithms (EDA)
- Incorporate the notions in evolutionary MOO
- Acquire various strategies by a single run
- Learning classifier system
- The choice of action-selection policies can greatly affect the performance of the learning system
- α-domination strategy
- Use a goal-directed bias based on the achievement level of each evaluation
- Parallel genetic algorithm (PGA)
- Evolve a neuro-controller
- Perturbation stochastic approximation (SPSA) was used to improve the convergence
- Adaptive margins
To obtain Pareto optimal policies in large or continuous spaces
- Multi-agent framework
- ie. Traffic signal control
- Fitted Q-iteration (FQI)
- Approximate the Pareto front
- Consistency multi-objective dynamic programming
The small scale of previous MORL problems may not verify the algorithm’s performance in dealing with a wide range of different problem settings, and the algorithm implementations always require much prior knowledge about the problem domain