/elite_buffer_vtrace

Distributed RL platform with modified IMPALA architecture. Implements CLEAR, LASER V-trace modifications along with Attentive and Elite sampling experience replay methods.

Primary LanguagePythonMIT LicenseMIT

PyTorch distributed RL platform

General distributed RL platform based on modified IMPALA architecture written in pure Python (>=3.8). It provides features for training/testing/evaluating distributed RL agents on a single-node computational instance. Multi-node scaling is not supported.

Architecture

Platform presently supports these agents:

For mixing on/off-policy data these replay buffer methods are supported:

Environment support:

  • ALE environments

Prerequisites

Before using the RL platform, we need to install all the required dependencies. We have implemented the application in python 3.8 and PyTorch 1.8.2. Therefore, it can be run on any OS with a python 3 interpreter. We recommend using python virtual environments for building an execution environment manually or Docker for automatic deployment. Required python modules are listed in requirements.txt located inside the top folder of the project alongside the main.py file – beware that some sub-dependent modules may not be listed.

Installing requirements

pip install -r requirements.txt

Downloading ALE environments

wget http://www.atarimania.com/roms/Roms.rar
sudo apt install unrar 
unrar e Roms.rar
unzip ROMS.zip
ale-import-roms ROMS

Running the agent

Before starting any training, it is advantageous to study the file option_flags.py that contains all application (hyper)parameters and their default values. The entry point of the application is located inside main.py. Each execution is uniquely identified – within the computational instance – with the agent’s environment name + UNIX timestamp. All files related to the specific execution, regardless of their purpose, are stores inside folder <project>/results/<environment_name>_<unix_timestamp>.
A user can safely interrupt the application with a SIGINT signal (CTRL+C in terminal). Training progress will be safely stored before termination. The current implementation only supports environments from ALE with explicit suffix NoFrameskip-v4. Frameskipping is handled by a custom environment pre-processing wrapper, and usage of sticky actions has not been tested yet – therefore, it is not supported. Presently, the RL platform only supports the V-trace RL algorithm and can be operated in 3 modes – new training, testing, and training from a checkpoint.

Start new training

python main.py --op_mode=train \
--env=PongNoFrameskip-v4  \
--environment_max_steps=1000000 \
--batch_size=10 \
--r_f_steps=10 \
--worker_count=2 \
--envs_per_worker=2 \
--replay_parameters='[{"type": "queue", "capacity": 1, "sample_ratio": 0.5}, {"type": "standard", "capacity": 1000, "sample_ratio": 0.5}]'

Continue training from checkpoint

python main.py --op_mode=train_w_load \
--environment_max_steps=1000000 \
--load_model_url=<path_to_model>

Test trained agent

python main.py --op_mode=test \
--test_episode_count=1000 \
--load_model_url=<path_to_model> \
--render=True

Multiple experiments can be executed in sequence using a python loop in file multi_train.py or a custom loop in terminal scrip (like bash script) applied on standard application entry point in main.py. The order of importance of different application arguments is this:

  1. Standard application arguments (argv[:1])
  2. Additional arguments passed to application from multi_train.py with change_args function
  3. Arguments loaded from saved checkpoint file
  4. Default argument values stored inside option_flags.py

Values of arguments with higher priority overwrite those with lower priority.

Another important thing to note, is that each training needs to have at least 1 replay object and total sum of the sample_ratios of all used replays must be 1. Sample ratio dictates proportion of samples taken from each replay to form a batch. If we don't want to use replay and only want to pass experiences as they are being generated we can use Queue with size 1. replay_parameters='[{"type": "queue", "capacity": 1, "sample_ratio": 1}]'

Todo

  • Multi-learner architecture
  • Adaptive asynchronous batching (caching)
  • Support other environment collections like MuJoCo, DeepMind Lab
  • Implement PPO based distributed algorithm, i.e., IMPACT
  • System for saving performance metric values into text files in chunks in periodical intervals
  • Custom testing-worker used solely for collecting values of performance metrics by following current policy
  • Multi-GPU support
  • Implement a graphical user interface (GUI) for monitoring training progress and hardware utilization

References

[1]"IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures," Proceedings of the 35th International Conference on Machine Learning, vol. 80, pp. 1407-1416, 2018

[2]"Experience Replay for Continual Learning," Advances in Neural Information Processing Systems, p. 32, 2019

[3]"Off-Policy Actor-Critic with Shared Experience Replay," Proceedings of the 37th International Conference on Machine Learning, 2020

[4]"Attentive Experience Replay," Proceedings of the AAAI Conference on Artificial Intelligence 34, pp. 5900-5907, 03 04 2020

[5]"IMPACT: Importance Weighted Asynchronous Architectures with Clipped Target Networks," Proceedings of the 8th International Conference on Learning Representations, 2020

Elite experience replay

It is a replay buffer method that utilizes elite sampling technique that uses an estimate of n-steps state transition „off-policiness” to prioritize selected samples from replay to increase the overall sample efficiency of the RL algorithm. Elite sampling calculates the similarity between the same state sequence encoded with behavioral policy (encoded when the sequence is generated by a worker) and target policy . States encoded into several values using the policy NN model are referred to as state feature vectors.

We have tested elite experience replay in combination with the V-trace agent on several environments from ALE and compared its performance to an agent with a standard replay buffer. Our experiments proved that elite sampling improves agents' performance over uniform sampling in the high policy volatile parts of the training process. Furthermore, a decrease in the agent’s training speed caused by the computation of feature vector distance metric can be partially counteracted by preparing training batches pre-emptively in the background with caching.

Architecture Architecture
Breakout Seaquest

Implemented elite sampling strategies (sampling is executed only on small random subset of replay - based on the size of batch and batch multiplier hyperparameters):

  1. Pick a batch number of samples with the lowest off-policy distance metric.
  2. Sort samples based on the off-policy distance metric. Then divide them into a batch number of subsets. From each subset, pick the trajectory with the lowest off-policy distance metric.
  3. Same as 1 with the addition that we prioritize those samples that have been sampled the least.
  4. Same as 2 with the addition that we prioritize those samples that have been sampled the least.

Running elite sampling agent

python main.py --op_mode=train \
--env=PongNoFrameskip-v4  \
--environment_max_steps=1000000 \
--replay_parameters='[{"type": "custom", "capacity": 1000, "sample_ratio": 0.5, "dist_function":"ln_norm", "sample_strategy":"elite_sampling", "lambda_batch_multiplier":6, "alfa_annealing_factor":2.0} ,{"type": "queue", "capacity": 1, "sample_ratio": 0.5}]'