Reproduction of the experiment 3.1.2 on meta-learning described in the paper "Learning to Reinforcement Learn"
This repository works in a similar way as : https://github.com/ThomasLecat/learning-to-learn-1 The main difference comes from the adaptation of the code to the eleven armed bandit environment used in this experiment.
The environment required in this experiment can be found here : https://github.com/ThomasLecat/gym-bandit-environments.git and is named "BanditElevenArmedWithIndex-v0"
In this setup, each bandit MDP has 11 arms which always give a reward. Among the first ten arms, nine ”non-target” arms give a reward of 1.1, and one ”target-arm” gives a reward of 5. The eleventh arm is an ”informative arm” whose reward is worth a tenth of the index of the target-arm. For example, if the reward of the informative arm is 0.1, then the target arm is the first arm ; if it is 0.2, the the target arm is the second arm, etc. Each trial is 5 steps long, so the optimal strategy is to pull the informative arm first, infer from the reward obtained the index of the target arm, then pull this arm for the last four steps.
The purpose of the experiment is to study whether the agent is capable of learning the link between the eleventh arm reward value and the index of the optimal arm.
Bandits environments are stateless but the training is organised in fake episodes during which the internal state of the LSTM is kept. The length of the fake episode is 5 trials in this experiment but can be set to a different value with the parameter -n (see "How to" section)
Please refer to the original paper for a detailed description of the experiment.
Similarly to the repository learning-to-learn-1, the input of the agent's network when using bandit environments is the last action, last reward and timestep in the episode stacked together.
The last action and last reward both are translated into one-hot vectors before being fed to the network. The code in this repository is specifically designed to take care of this translation, which makes it unusable with other bandit environments than BanditElevenArmedWithIndex-v0.
The two callable scripts are train.py and test.py. Both take the same arguments as input. Among them, we can find:
- -w : number of workers working in parallel
- -m : to resample environments at the beginning of each episode
- -n : number of trials in each episode (default is 5)
- -lr : learning rate (default is 1e-4)
Meta-learning is performed as soon as the -m argument is present. In that case, the environment is recreated at the beginning of each episode. As the index of the target arm is sampled randomly, the configuration of the MDP changes from one episode to another. This results in training (and / or testing) the agent on a set of MDPs instead of a single one.
The hyperparameters are somewhat spread across the code... Here's the location of some of them :
- number of training steps : num_global_step in worker.py run function.
- number of testing steps : num_test_step next to num_global_step in worker_test.py
- discount factor : file A3C.py, class A3C, method process : change gamma value in the line : "batch = process_rollout(rollout, gamma=0.99, lambda_=1.0)"
- number of steps in each rollout (t_max in the original A3C paper): file A3C.py, class A3c, method init, change the value in line : "num_local_step = 5"
- learning rate : change by adding the argument -lr when calling python train.py (see section above)
- number of trials in a fake episode for bandit environments : change by adding the argument -n when calling python trian.py (see section above)
- Python 2.7 or 3.5
- Golang
- six (for py2/3 compatibility)
- TensorFlow 0.12
- tmux (the start script opens up a tmux session with multiple windows)
- htop (shown in one of the tmux windows)
- gym
- gym[atari]
- libjpeg-turbo (
brew install libjpeg-turbo
) - universe
- opencv-python
- numpy
- scipy
conda create --name learning-to-learn-2 python=3.5
source activate learning-to-learn-2
brew install tmux htop cmake golang libjpeg-turbo # On Linux use sudo apt-get install -y tmux htop cmake golang libjpeg-dev
pip install "gym[atari]"
pip install universe
pip install six
pip install tensorflow
conda install -y -c https://conda.binstar.org/menpo opencv3
conda install -y numpy
conda install -y scipy
Add the following to your .bashrc
so that you'll have the correct environment when the train.py
script spawns new bash shells
source activate learning-to-learn-2
python train.py --num-workers 2 --env-id BanditElevenArmedWithIndex-v0 --log-dir /tmp/banditEleven
The code will launch the following processes:
- worker-0 - a process that runs policy gradient
- worker-1 - a process identical to process-1, that uses different random noise from the environment
- ps - the parameter server, which synchronizes the parameters among the different workers
- tb - a tensorboard process for convenient display of the statistics of learning