Two algorithms were implemented to find the optimal policy which maximizes the expected score, Value Iteration and Policy Iteration.
Their implementation reflects the sudo-code shown in [1] figure 17.4 pp 653 (Value Iteration) and figure 17.7 pp 657 (Policy Iteration).
After empirical experiments Policy Iteration was chosen as the preferred method. The reasons and experimental data are reported in the sections below (Results Comparison).
This is the first algorithm being implemented. As the same name suggests, Value-Iteration starts with a random value of the vector of utility-values and progressively updates it until each single value converges to the actual value. Convergence in Value-Iteration is reached when the max norm of the difference between the vector of utility values before and after the update is less than a certain threshold denoted as max_err. The value of the threshold (max_err) affects both the final policy and the time the algorithm takes to converge. Setting a relatively high value or max_err speeds up the algorithm at the expenses of a possible suboptimal policy. Policies and execution times are reported in Table 2 for different values of max_err. Table 3 shows each policy performance in terms of average score and standard deviation.
Policy Iteration is the second algorihtm being implemented. It, on the other hand, stars with a given initial policy denoted with
The second step of the algorihtm is called policy updated. The initial policy is updated if the maximum value of the Q-function for a given state is greater than the previously calculated Utility-Value for the same state. The action for that state is updated within the policy with the action that produces the maximum value of the Q-function.
The Q-function is calculated for each state, if the policy has not been updated then the algorithm has converged, otherwise the same two steps, policy evalution and policy updated are repeted until convegency is reached.
It is worth noticing that compared to Value iteration where the utility values needed to be calculated iteratively, in Policy Iteration the value-functions for each state, given a policy
$ U_s = \sum_a P_{\pi}(a|s) * \sum_{s’}(P_{\pi} (s'|a,s) * (R(s,a,s') + \gamma * U(s')) $
Since we are dealing with deterministic policies the first summation can be removed because for each state the agent will execute the action determined by the policy itself with probability 1
$ U_s = \sum_{s’}( P_{\pi} (s'|a,s) * (R(s,a,s') + \gamma * U(s'))) $
Let's now imagine that after executing action a from state s we could end up in three different states
$ P1 + P2 + P3 = 1 $
We can then expand the equation above as:
$ U_s = P_{\pi} (s'|a,s) * R(s,a,s') + P_{\pi} (s''|a,s) * R(s,a,s'') + P_{\pi} (s'''|a,s) * R(s,a,s''') + \gamma\left(( P_{\pi} (s'|a,s)* U(s') + P_{\pi} (s''|a,s)* U(s'') + P_{\pi} (s'''|a,s)* U(s''')\right) $
Which can be rewritten as:
$ U = A + \gamma * B * U $
Where:
- U is the vector of utility values for each state s
-
$\gamma$ is the discount rate - B' is the matrix containing the conditional probability of ending in a state s' which multiplies the utility value of the end state
- A is a vector or size ns x 1 of terms independent from the vector of utility values The system of linear equation can be easily solved by using linear algebra.
$ U = (I-B')^{-1} * A $
The final value of the Value-Function depends on the reward, probability distribution and discount rate. While reward and probability distribution are given for each action and state, the discount rate gamma can be tuned to influence the final policy.
The discount rate
With gamma increasing the time required for both Value iteration and Policy iteration to converge increases. This does not constitute a major concern as for the standard game rules (3 dice 6 faces) both algorithms coverage well within the time limit of 30 seconds.
More information about the time taken by the two algorithms to converge to a solution are given in the Result Comparison section.
The algorithm in fully compatible with extended game rules, as it completely relies on the game class to establish the number of possible states, actions, rewards and probability distributions of ending in a certain state s1 given a state s0 and an action a0.
Once the optimal policy is calculated by means of the Policy Iteration algorithm, the policy itself is stored into a dictionary which uses states as keys and action as values. This allows to quickly retrieve the best action a (determined by the policy) given a state s.
This section reports the results of the experiments to establish both the optimal algorithm selection to find the optimal policy and the value of parameters like gamma and err_max (maximum absolute error allowed in value Iteration between two consecutive updates)
It is also worth mentionining that all results and graphs shown below are obtained for standard game rules (6 dice and 6 faces each dice) The
As mentioned above,
0.0 | 0.1 | 0.2 | 0.3 | 0.4 | 0.5 | 0.6 | 0.7 | 0.8 | 0.9 | 1 | |
---|---|---|---|---|---|---|---|---|---|---|---|
average score | 10.52 | 10.44 | 10.51 | 10.48 | 10.63 | 10.75 | 10.92 | 11.24 | 11.73 | 12.32 | 13.35 |
Fig 1: Effect of discount rate gamma on the average score (using Policy Iteration) |
It is clear that the best average score is obtained for gamma = 1
This section provides a comparison on the performances and policy provided by Value Iteration and Policy Iteration algorithm. For all algorithm the value of the discount rate gamma was set to 1 The table below displays the whole policy obtained by the two approaches end different values of the maximum errors for value Irritation.
Table t1.0 shows how value iteration is generally slower to converge with respect to Policy iteration. It also shows how the final policy is affected by the value of the maximum error allowed. For value iteration reducing the maximum error results in a more optimal policy at the expenses of the time taken to reach convergence. It is worth highlighting that the optimal policy for value iteration is reached for a maximum error less than 1e-4.
On the other hand, it is interesting to notice how policy iteration produces the optimal policy while still being the fastest to coverage
STATE | ACTION | ACTION | ACTION | ACTION |
---|---|---|---|---|
-- | Value Iteration (err = 1e-0) |
Value Iteration (err = 1e-2) |
Value Iteration (err = 1e-4) |
Policy Iteration |
(1, 1, 1) | (0, 1, 2) | (0, 1, 2) | (0, 1, 2) | (0, 1, 2) |
(1, 1, 2) | (0, 1) | (0, 1) | (0, 1) | (0, 1) |
(1, 1, 3) | (0, 1) | (0, 1) | (0, 1) | (0, 1) |
(1, 1, 4) | (0, 1, 2) | (0, 1, 2) | (0, 1, 2) | (0, 1, 2) |
(1, 1, 5) | (0, 1, 2) | (0, 1, 2) | (0, 1, 2) | (0, 1, 2) |
(1, 1, 6) | (0, 1, 2) | (0, 1, 2) | (0, 1, 2) | (0, 1, 2) |
(1, 2, 2) | (1, 2) | (1, 2) | (1, 2) | (1, 2) |
(1, 2, 3) | (0,) | (0,) | (0,) | (0,) |
(1, 2, 4) | (0,) | (0,) | (0,) | (0,) |
(1, 2, 5) | (0,) | (0,) | (0,) | (0,) |
(1, 2, 6) | (0,) | (0,) | (0,) | (0,) |
(1, 3, 3) | (0,) | (0,) | (0,) | (0,) |
(1, 3, 4) | (0,) | (0,) | (0,) | (0,) |
(1, 3, 5) | (0,) | (0,) | (0,) | (0,) |
(1, 3, 6) | (0,) | (0,) | (0,) | (0,) |
(1, 4, 4) | (0,) | (0,) | (0,) | (0,) |
(1, 4, 5) | (0,) | (0,) | (0,) | (0,) |
(1, 4, 6) | (0,) | (0,) | (0,) | (0,) |
(1, 5, 5) | (0,) | (0,) | (0,) | (0,) |
(1, 5, 6) | (0,) | (0,) | (0,) | (0,) |
(1, 6, 6) | (0,) | (0,) | (0,) | (0,) |
(2, 2, 2) | (0, 1, 2) | (0, 1, 2) | (0, 1, 2) | (0, 1, 2) |
(2, 2, 3) | (0, 1) | (0, 1) | (0, 1) | (0, 1) |
(2, 2, 4) | (0, 1, 2) | (0, 1, 2) | (0, 1, 2) | (0, 1, 2) |
(2, 2, 5) | (0, 1, 2) | (0, 1, 2) | (0, 1, 2) | (0, 1, 2) |
(2, 2, 6) | (0, 1, 2) | (0, 1, 2) | (0, 1, 2) | (0, 1, 2) |
(2, 3, 3) | (0,) | (0,) | (0,) | (0,) |
(2, 3, 4) | (0,) | (0,) | (0,) | (0,) |
(2, 3, 5) | (0,) | (0,) | (0,) | (0,) |
(2, 3, 6) | (0,) | (0,) | (0,) | (0,) |
(2, 4, 4) | (0,) | (0,) | (0,) | (0,) |
(2, 4, 5) | (0,) | (0,) | (0,) | (0,) |
(2, 4, 6) | (0, 1, 2) | (0,) | (0,) | (0,) |
(2, 5, 5) | (0,) | (0,) | (0,) | (0,) |
(2, 5, 6) | (0, 1, 2) | (0, 1, 2) | (0, 1, 2) | (0, 1, 2) |
(2, 6, 6) | (0,) | (0,) | (0,) | (0,) |
(3, 3, 3) | (0, 1, 2) | () | () | () |
(3, 3, 4) | (0, 1, 2) | () | () | () |
(3, 3, 5) | (0, 1, 2) | (0, 1, 2) | (0, 1, 2) | (0, 1, 2) |
(3, 3, 6) | (0, 1, 2) | (0, 1, 2) | (0, 1, 2) | (0, 1, 2) |
(3, 4, 4) | () | () | () | () |
(3, 4, 5) | (0, 1, 2) | () | () | () |
(3, 4, 6) | (0, 1, 2) | (0, 1, 2) | (0, 1, 2) | (0, 1, 2) |
(3, 5, 5) | () | () | () | () |
(3, 5, 6) | (0, 1, 2) | (0, 1, 2) | (0, 1, 2) | (0, 1, 2) |
(3, 6, 6) | (0, 1) | () | () | () |
(4, 4, 4) | () | () | () | () |
(4, 4, 5) | () | () | () | () |
(4, 4, 6) | (0, 1, 2) | () | () | () |
(4, 5, 5) | () | () | () | () |
(4, 5, 6) | (0, 1, 2) | (0, 1, 2) | (0, 1, 2) | (0, 1, 2) |
(4, 6, 6) | (0, 1) | () | () | () |
(5, 5, 5) | () | () | () | () |
(5, 5, 6) | (0, 2) | (0, 2) | () | () |
(5, 6, 6) | (0, 1) | (0, 1) | () | () |
(6, 6, 6) | () | () | () | () |
TIME TO CONVERGE (s) | 0.85 | 2.9 | 4.6 | 1.2 |
Table 2: Comparison between Value-Iteration and Policy-Iteration, policies and time convergence
The table below (Table 3) report the policies performances obtained with policy iteration and value iteration for different values of the maximum absolute error. As Value Iteration err = 1 The specific of the experiments reported in Table 1.1 are: number of samples = 200, sample size = 5000
Value Iteration (err = 1e-0) | Value Iteration (err = 1e-2) | Policy Iteration | |
---|---|---|---|
Mean | 13.2996 | 13.345 | 13.348 |
Std Deviation | 0.035 | 0.037 | 0.035 |
Table 3: Performance comparison between policies obtained with different algorithms
Fig 2: Avarage score distribution for policy generated by Value Iteration ( max_err = 1) |
Fig 3: Avarage score distribution for policy generated by Value Iteration ( max_err = 0.01) |
Fig 4: Avarage score distribution for policy generated by Policy Iteration |
[1] Russell, S, & Norvig, P 2016, Artificial Intelligence: a Modern Approach, EBook, Global Edition : A Modern Approach, Pearson Education, Limited, Harlow. Available from: ProQuest Ebook Central. [30 September 2021].