🚂🚋🚋🚋🚋🚋

An Empirical Comparison of Off-policy Prediction Learning Algorithms on the Collision Task

This repository includes the code for the "empirical off-policy" paper.

Table of Contents

Specification of Dependencies

This code requires python 3.5 or above. Packages that are required for running the code are all in the requirements.txt file. To install these dependencies, run the following command if your pip is set to python3.x:

pip install requirements.txt

otherwise, run:

pip3 install requirements.txt

Algorithms

Algorithms are used to find a weight vector, w, such that the dot product of w and the feature vector, approximates the value function.

Off-policy TD

Paper Off-Policy Temporal-Difference Learning with Function Approximation
Authors Doina Precup, Richard S. Sutton, Sanjoy Dasgupta

delta = r + gamma * np.dot(w, x_p) - np.dot(w, x)
z = rho * (gamma * lmbda * z + x)
w += alpha * delta * z

Gradient-TD algorithms

GTD/TDC

Paper Off-Policy Temporal-Difference Learning with Function Approximation
Authors Richard S. Sutton, Hamid Reza Maei, Doina Precup, Shalabh Bhatnagar, David Silver, Csaba Szepesvàri, Eric Wiewiora

delta = r + gamma * np.dot(w, x_p) - np.dot(w, x)
z = rho * (gamma * lmbda * z + x)
w += alpha * (delta * z - gamma * (1 - lmbda) * np.dot(z, v) * x_p)
v += alpha_v * (delta * z - np.dot(x, v) * x)

GTD2

Paper Off-Policy Temporal-Difference Learning with Function Approximation
Authors Richard S. Sutton, Hamid Reza Maei, Doina Precup, Shalabh Bhatnagar, David Silver, Csaba Szepesvàri, Eric Wiewiora

delta = r + gamma * np.dot(w, x_p) - np.dot(w, x)
z = rho * (gamma * lmbda * z + x)
w += alpha * (np.dot(x, v) * x - gamma * (1 - lmbda) * np.dot(z, v) * x_p)
v += alpha_v * (delta * z - np.dot(x, v) * x)

HTD

Paper Investigating Practical Linear Temporal Difference Learning
Authors Adam White, Martha White

delta = r + gamma * np.dot(w, x_p) - np.dot(w, x)
z = rho * (gamma * lmbda * z + x)
z_b = gamma * lmbda * z_b + x
w += alpha * ((delta * z) + (x - gamma * x_p) * np.dot((z - z_b), v))
v += alpha_v * ((delta * z) - (x - gamma * x_p) * np.dot(v, z_b))

Proximal GTD2

Paper Proximal Gradient Temporal Difference Learning: Stable Reinforcement Learning with Polynomial Sample Complexity
Authors Bo Liu, Ian Gemp, Mohammad Ghavamzadeh, Ji Liu, Sridhar Mahadevan, Marek Petrik

delta = r + gamma * np.dot(w, x_p) - np.dot(w, x)
z = rho * (gamma * lmbda * z + x)
v_mid = v + alpha_v * (delta * z - np.dot(x, v) * x)
w_mid = w + alpha * (np.dot(x, v) * x - (1 - lmbda) * gamma * np.dot(z, v) * x_p)
delta_mid = r + gamma * np.dot(w_mid, x_p) - np.dot(w_mid, x)
w += alpha * (np.dot(x, v_mid) * x - gamma * (1 - lmbda) * np.dot(z, v_mid) * x_p)
v += alpha_v * (delta_mid * z - np.dot(x, v_mid) * x)

TDRC

Paper Gradient Temporal-Difference Learning with Regularized Corrections
Authors Sina Ghiassian, Andrew Patterson, Shivam Garg, Dhawal Gupta, Adam White, Martha White

delta = r + gamma * np.dot(w, x_p) - np.dot(w, x)
z = rho * (gamma * lmbda * z + x)
w += alpha * (delta * z - gamma * (1 - lmbda) * np.dot(z, v) * x_p)
v += alpha_v * (delta * z - np.dot(x, v) * x) - alpha_v * tdrc_beta * v

Emphatic-TD algorithms

Emphatic TD

Paper An Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning
Authors Richard S. Sutton, A. Rupam Mahmood, Martha White

delta = r + gamma * np.dot(w, x_p) - np.dot(w, x)
z = rho * (gamma * lmbda * z + x)
F = gamma * old_rho * F + 1
m = lmbda * 1 + (1 - lmbda) * F
z = rho * (x * m + gamma * lmbda * z)
w += alpha * delta * z

Emphatic TDβ

Paper Generalized Emphatic Temporal Difference Learning: Bias-Variance Analysis
Authors Assaf Hallak, Aviv Tamar, Remi Munos, Shie Mannor

delta = r + gamma * np.dot(w, x_p) - np.dot(w, x)
z = rho * (gamma * lmbda * z + x)
F = beta * old_rho * F + 1
m = lmbda * 1 + (1 - lmbda) * F
z = rho * (x * m + gamma * lmbda * z)
w += alpha * delta * z

Variable-λ algorithms

Tree backup/ Tree backup for prediction

Paper Eligibility Traces for Off-Policy Policy Evaluation
Authors Doina Precup, Richard S. Sutton, Satinder Singh

The algorithm pseudo-code described below is the prediction variant of the original Tree backup algorithm proposed by Precup, Sutton, and Singh (2000). The prediction variant of the algorithm used here is first derived in the current paper.

delta = rho * (r + gamma * np.dot(w, x_p) - np.dot(w, x))
z = gamma * lmbda * old_pi * z + x
w = w + alpha * delta * z

Vtrace (simplified)

Paper [IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures] (http://proceedings.mlr.press/v80/espeholt18a/espeholt18a.pdf)
Authors Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Volodymyr Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, Shane Legg, Koray Kavukcuoglu

delta = r + gamma * np.dot(w, x_p) - np.dot(w, x)
z = min(1, rho) * (gamma * lmbda * z + x)
w += alpha * delta * z

ABQ/ABTD

Paper Multi-step Off-policy Learning Without Importance Sampling Ratios
Authors A. Rupam Mahmood, Huizhen Yu, Richard S. Sutton

The algorithm pseudo-code described below is the prediction variant of the original Tree backup algorithm proposed by Mahmood, Sutton, and Yu (2017). The prediction variant of the algorithm used here is first derived in the current paper. This algorithm first needs to compute the following:

xi_zero = 1
xi_max = 2
xi = 2 * zeta * xi_zero + max(0, 2 * zeta - 1) * (xi_max - 2 * xi_zero)

xi_zero and xi_max are specifically computed here for the Collision problem. To see how these are computed for the task see the original paper referenced above.

nu = min(xi, 1.0 / max(pi, mu))
delta = rho * (r + gamma * np.dot(w, x_p) - np.dot(w, x))
nu = min(xi, 1.0 / max(pi, mu))
z = x + gamma * old_nu * old_pi * z
w += alpha * delta * z

Algorithm Glossary

Here, we briefly explain all the symbols and variables names that we use in our implementation.

meta-parameters

  • Common parameters of all algorithms:
    • alpha (α): is the step size that defines how much the weight vector w is updated at each time step.
    • lambda (λ): is the bootstrapping parameter.
  • Common parameters of Gradient-TD algorithms:
    • alpha_v (αv): is the second step size that defines how much the second weight vector v is updated at each time step.
  • beta (β): is the parameter used by the ETDβ algorithm that defines how much the product of importance sampling ratios from the past affects the current update.
  • tdrc_beta (tdrcβ): is the regularization parameter of the TDRC algorithms. This parameter is often set to 1.
  • zeta (ζ): is only used in the ABTD algorithm. It is similar to the bootstrapping parameter of other algorithms.

Algorithms variables

  • w: is the main weight vector being learned. init: w=0.
  • v: is the secondary weight vector learned by Gradient-TD algorithms. init: v=0.
  • z: is the eligibility trace vector. init: z=0.
  • zb: is the extra eligibility trace vector used by HTD. init: z_b=0.
  • delta (𝛿): is the td-error, which in the full bootstrapping case, is equal to the reward plus the value of the next state minus the value of the current state.
  • s: is the current state (scalar).
  • x: is the feature vector of the current state.
  • s_p: is the next state (scalar).
  • x_p: is the feature vector of the next state.
  • r: is the reward.
  • rho (ρ): is the importance sampling ratio, which is equal to the probability of taking an action under the target policy divided by the probability of taking the same action under the behavior policy.
  • old_rho (oldρ): is the importance sampling ratio at the previous time step.
  • pi (π): is the probability of taking an action under the target policy at the current time step.
  • old_pi (oldπ): is the probability of taking an action under the target policy in the previous time step. The variable π itself is the probability of taking action under the target policy at the current time step.
  • F : is the follow-on trace used by Emphatic-TD algorithms.
  • m : is the emphasis used by Emphatic-TD algorithms.
  • nu (ν): Variable used by the ABQ/ABTD algorithm. Please refer to the original paper for explanation.
  • xi (ψ): Variable used by the ABQ/ABTD algorithm. Please refer to the original paper for explanation.
  • mu (μ): is the probability of taking action under the behavior policy at the current time step.
  • old_mu (oldμ): is the probability of taking an action under the target policy at the previous time step.
  • gamma (γ): is the discount factor parameter.

Environment

At the heart of an environment is an MDP. The MDP defines the states, actions, rewards, transition probability matrix, and the discount factor.

Chain Environment and the Collision Task



An MDP with eight states is at the heart of the task. The agent starts in one of the four leftmost states with equal probability. One action in available in the four leftmost states: forward. Two actions are available in the four rightmost states: forward and turn. By taking the forward action, the agent transitions one state to the right and by taking the turn action, it moves away from the wall and transitions to one of the four leftmost states equiprobably. Rewards are all zero except for taking forward in state 8 for which a +1 is emitted. Termination function (discount factor) returns 0.9 for all transitions except for taking turn in any state or taking forward in state 8, for which the termination function returns zero.
env = Chain()
env.reset() # returns to one of the four leftmost states with equal probability.
for step in range(1, 1000):
    action = np.random.randint(0, 2) #  forward=0, turn=1
    sp, r, is_wall = env.step(action=action)
    if is_wall:
        env.reset()

We applied eleven algorithms to the Collision task: Off-policy TD(λ), GTD(λ), GTD2(λ), HTD(λ), Proximal GTD2(λ), TDRC(λ) , ETD(λ), ETD(λ,β), Tree Backup(λ), Vtrace(λ), ABTD(ζ). The target policy was π(forward|·) = 1.0. The behavior policy was b(forward|·) = 1.0 for the four leftmost states and b(forward|·) = 0.5, b(retreat|·) = 0.5 for the four rightmost states. Each algorithm was applied to the task with a range of parameters. We refer to an algorithm with a specific parameter setting as an instance of that algorithm. Each algorithm instance was applied to the Collision task for 20,000 time steps, which we call a run. We repeated the 20,000 time steps for fifty runs. All instances of all algorithms experienced the same fifty trajectories.

Linear function approximation was used to approximate the true value function. Each state was represented by a six dimensional binary feature vector. The feature representation of each state had exactly three zeros and three ones. The locations of the zeros and ones were selected randomly. This was repeated once at the beginning of each run, meaning that the representation for each run is most probably different from other runs. At the beginning of each run we set w0 = 0 and thus the error would be the same for all algorithms at the beginning of the runs.

Feature representation

The feature representation for the collision task is an array of size 8, 6, 50, where 8 corresponds to the number of states, 6 correponds to the number of features for each state, and 50 corresponds to the number of runs. The feature representations used for the set of results presented here and in the paper is saved in:

Resources/EightStateCollision/feature_rep.npy

Note that the feature representaiton for each run is different in the Collision task. For example, the feature representation for the first run is:

array([[0., 0., 1., 0., 1., 1.],
       [1., 1., 1., 0., 0., 0.],
       [0., 1., 1., 0., 0., 1.],
       [1., 0., 1., 1., 0., 0.],
       [1., 1., 0., 0., 1., 0.],
       [0., 1., 1., 1., 0., 0.],
       [1., 1., 0., 0., 0., 1.],
       [1., 0., 1., 0., 0., 1.]])

State distribution induced by the behavior policy

To compute an approximation of the mean squared value error at each time step, weighting induced by the behavior policy was approximated by following the behavior policy for 20,000,000 time step and computing the fraction of time spent in each state. The resulting distribution is saved in:

Resources/EightStateCollision/d_mu.npy

d_mu.npy is a one dimensional numpy array of size 8:

array([0.05715078, 0.1142799 , 0.17142456, 0.22856842, 0.22856842, 0.11428067, 0.05715311, 0.02857415])

True state values

To compute an approximation of the mean squared value error at each time step, we need the true state values. Luckily, for the Collision task, these values are easy to compute. We computed these true values by following the target policy from each state to the wall once. The resulting values are saved in:

Resources/EightStateCollision/state_values.npy

state_values.npy is a one dimensional numpy array of size 8:

array([0.4782969, 0.531441, 0.59049, 0.6561, 0.729, 0.81, 0.9, 1])

How to Run the Code

The code can be run in two different ways. One way is through learning.py that can be used to run small experiments on a local computer. The other way is through the files inside the Job directory. We explain each of these approaches below by means of an example.

Running on Your Local Machine

Let's take the following example: applying Off-policy TD(λ) to the Collision task. There are multiple ways for doing this. The first way is to open a terminal and go into the root directory of the code and run Learning.py with proper parameters:

python3 Learning.py --algorithm TD --task EightStateCollision --num_of_runs 50 --num_steps --environment Chain
--save_value_function Ture --alpha 0.01 --lmbda 0.9

In case any of the parameters are not specified, a default value will be used. The default value is set in the Job directory, inside the JobBuilder.py file. This means, the code, can alternatively be run, by setting all the necessary values that an algorithm needs at the top of the JobBuilder.py file. Note that not all parameters specified in the default_params dict are required for all algorithms. For example, the tdrc_beta parameter is only required to be set for the TDRC(λ) algorithms. Once the variables inside the default_params dictionary, the code can be run:

python3 Learning.py

Or one can choose to specify some parameters in the default_params dictionary and specify the rest as command line argumets like the following:

python3 Learning.py --algorithm TD --task EightStateCollision --alpha 0.01

Running on Servers with Slurm Workload Managers

When parameter sweeps are necessary, the code can be run on supercomputers. The current code supports running on servers that use slurm workload managers such as compute canada. For exampole, to apply the TD algorithm to the Collision (EightStateCollision) task, with various parameters, first you need to create a json file that specifies all the parameters that you would like to run, for example:

{
  "agent": "TD",
  "environment": "Chain",
  "task": "EightStateCollision",
  "number_of_runs": 50,
  "number_of_steps": 20000,
  "sub_sample": 1,
  "meta_parameters": {
    "alpha": [
      0.000003814, 0.000007629, 0.000015258, 0.000030517, 0.000061035, 0.000122070, 0.000244140, 0.000488281,
      0.000976562, 0.001953125, 0.00390625, 0.0078125, 0.015625, 0.03125, 0.0625, 0.125, 0.25, 0.5, 1.0
    ],
    "lmbda": [
      0.1, 0.2, 0.3
    ]
  }
}

and then run main.py using python:

python3 main.py -f <path_to_the_json_file> -s <kind_of_submission>

where kind_of_submission refers to one of the two ways you can submit your code:

  1. You can request an individual cpu for each of the algorithm instances, where an algorithm instance refers to an algorithm with specific parameters. To request an individual cpu, run the following command:
python3 main.py -f <path_to_the_json_file_or_dir> -s cpu

When running each algorithm instance on a single cpu, you need to specify the following parameters inside Job/SubmitJobsTemplatesCedar.SL:

#SBATCH --account=xxx
#SBATCH --time=00:15:58
#SBATCH --mem=3G

where #SBATCH --account=xxx requires the account you are using in place of xxx, #SBATCH --time=00:15:58 requires the time you want to request for each individual cpu, and #SBATCH --mem=xG requires the amount of memory in place of x.

  1. You can request a node, that we assume includes 40 cpus. If you request a node, the jobs you submit will run in parallel 40 at a time, and once one job is finished, the next one in line will start running. This process continues until either all jobs are finished running, or you run out of the time you requested for that node.
python3 main.py -f <path_to_the_json_file_or_dir> -s node

When running the jobs on nodes, you need to specify the following parameters inside Job/SubmitJobsTemplates.SL:

#SBATCH --account=xxx
#SBATCH --time=11:58:59
#SBATCH --nodes=x
#SBATCH --ntasks-per-node=40

where #SBATCH --account=xxx requires the account you are using in place of xxx, #SBATCH --time=11:58:59 requires the time you want to request for each individual node, each of which includes 40 cpus in this case, and #SBATCH --nodes=x requires the number of nodes you would like to request in place of x. If you request more than one node, your jobs will be spread across nodes, 40 on each node, and once each job finishes, the next job in the queue will start running. #SBATCH --ntasks-per-node=xx is the number of jobs you would like to run concurrently on a single node. In this case, for example, we set it to 40.

If path_to_the_json_file_or_dir is a directory, then the code will walk into all the subdirectories, and submits jobs for all the parameters in the json files that it finds inside those directories sequentially. If path_to_the_json_file_or_dir is a file, then the code will submit jobs for all the parameters that it finds inside that single json file. Note that you can create a new directory for each experiment that you would like to run, and create directories for each of the algorithms you would like to run in that experiment. For example, we created a directory called FirstChain inside the Experiments directory and created one directory per algorithm inside the FirstChain directory for each of the algorithms and specified a json file in that directory. It is worth noting that whatever parameter that is not specified in the json file will be read from the default_params dictionary inside the Job directory inside the JobBuilder.py file.

Plotting the results

The following table shows all the parameters that we tested in the experiments:

We now explain how each figure in the paper can be reproduced. All the figures of the paper can be reproduced using the plot_data.py file, once you run the Learning.py script with all the needed parameters. If you do not have the results available, the plot_data.py script will return an error.

  1. Processing the data: This script manipulates data in a way that it is ready to be plotted over step sizes and also such that the data is ready to be plotted as learning curves averaged over runs. The process_data script also re-runs the algorithms with their best parameters to eliminate possible maximization bias, as explained in the paper. This is a time consuming step. If you do not like to do this step, simply set:

    PLOT_RERUN = False

    in Plotting/plot_params.py and the code will ignore the re-running steps. If you would like to eliminate the maximization bias, set:

    PLOT_RERUN = True

    Finally, go to plot_data.py and set func_to_run = 'process_data', and run the plot_data.py script.

  2. Plotting the learned value functions: Go to plot_data, and set func_to_run = 'plot_value_functions' to plot the learned value functions for some of the runs, and set func_to_run = plot_all_final_value_functions to plot the value function learned by the last time step of all of the runs in one plot.


  3. Plotting the learning curves with specific parameter values: Go to plot_data, and set func_to_run = 'specific_learning_curves_full_bootstrap', and run the plot_data.py script.


  4. Plotting the parameter studies for step size for all algorithms: Go to plot_data, and set func_to_run = 'collision_sensitivity_curves_for_many_lambdas', and run the script.


  5. Plotting the parameter sensitivity study of Emphatic-TD algorithms: Go to plot_data, and set func_to_run = 'collision_emphatics_sensitivity_full_bootstrap', and run the script.


  6. Plotting the parameter sensitivity study of Gradient-TD algorithms: Go to plot_data, and set func_to_run = 'collision_gradients_sensitivity_full_bootstrap', and run the script.