/reinforced_scinet

Operationally meaningful representations of reinforcement learning environments.

Primary LanguagePythonApache License 2.0Apache-2.0

Reinforced SciNet

This is reinforced-scinet, learning operationally meaningful representations of reinforcement learning (RL) environments.

This code accompanies our paper,

H. Poulsen Nautrup, T. Metger, R. Iten, S. Jerbi, L.M. Trenkwalder, H. Wilming, H.J. Briegel and R. Renner
Operationally meaningful representations of physical systems in neural networks,
arXiv:2001.00593 [quant-ph] (2020).

If you use any of the code, please cite our paper. Here, we provide the PyTorch code for the examples from Sec. 6 of the paper and beyond. In particular, this repository contains:

  1. Code for asynchronous RL with deep energy-based projective simulation models.
  2. Code for asynchronous representation learning as described in the paper.
  3. Trained models along with detailed results from the evaluation of the model.

The code for the examples that do not involve RL can be found here.

Enjoy!

Requirements

In order to run the code you will require the following:

  • python 3.7.4
  • numpy 1.17.2
  • torch 1.2.0
  • matplotlib 3.1.1
  • gym-subgridworld 0.0.2 from here

You may run the code with different versions, but these are the versions we have verified the code with. We also recommend working with GPU since we have not thoroughly tested the code with only CPUs.

There is quite a lot of documentation in the code already. For further questions, please contact us directly.

Architecture

The asynchronous architecture is inspired by GA3C. In policy mode we train various deep energy-based projective simulation (DPS) models on the same environment but with different objectives. In selection mode we use neural networks to encode and decode observations received from a RL environment. The decoders either predict the behavior of trained RL agents or reproduce the observation like an autoencoder. The architecture is designed to be asynchronous and may make use of a GPU. A more detailed description can be found in the paper. The specific architecture can be illustrated as follows:

Asynchronous RL

Get started

You can immediately run the code to predict the behavior of three trained deep reinforcement learning models. To this end, you just need to run the main.py file with python and watch the results being logged to your console. (You might have to move the content of data/publication/ to data/ so it can be loaded.)

Chances are that your local computer cannot run the code with the same parameters that we used. You can decrease the workload by reducing the number of processes in the config.py. That is, you need to lower the numbers of WORKERS, TRAINERS and PREDICTORS. For example, you can try 8 workers, 1 trainer and 1 predictor for each environment ID. If you cannot use a GPU, you can change the DEVICE parameter to cpu. However, we have not thoroughly tested this.

Once you run the program, you should see numbers like the following:

[ Training #       34400 ][ Episode #     341 ][ Loss for type: Policy ][ Trainer for type:  env1 ][ Loss: 1.20120335. ]
[ Training #       34400 ][ Episode #     342 ][ Loss for type: Autoencoder ][ Trainer for type:  env1 ][ Loss: 0.29708305. ]

What you see is the training of the prediction agents and an autoencoder. Given an environment objective (here env1), the policy prediction has a smooth L1 loss of 1.2. The autoencoder trying to reproduce the input has a binary cross entropy loss of 0.2 97. Over time, this will be reduced while the selection neurons start affecting the latent representation with noise. A log of the amount of noise is being generated at results_log/selection.txt. There you find results like this:

env1, [-9.83377456665039, -9.83377456665039, -9.83377456665039]

These numbers quantify the noise of the three selection neurons for the environment objective (or decoder agent) env1. Once one of these values increases above 0. the value of the associated latent neuron cannot be recovered by a decoder.

Example

In the paper, we demonstrate representation learning for a sub-grid world environment. Here, we describe how you may reproduce those results. The code for the environment can be found here.

As described above, we split the training into two modes.

Policy mode

In this training mode, we train the DPS agents on the RL environment. For the results, we first trained three DPS agents to solve the sub-grid world environment. The parameters can be found in the config.py file. In order to reproduce the results, you need to switch TRAIN_MODE from selection to policy mode, set LOAD_PS to False for all agents, and run the main.py for 3M episodes. For your convenience, we already provide the pretrained agents in the load_models folder. When training these agents, they performed as follows:

Results RL

Selection mode

In this training mode, we train decoders to predict behavior of DPS agents or reproduce the input. This is the current setting of the code and can be immediately performed by running bash _run.sh or python main.py in the main directory. That is, if you run the code as is, you will start training one autoencoder and three decoders to predict the behavior of the pretrained DPS agents. At the same, selection neurons will pressure the encoder to create a representation of the observation that can be shared among decoders efficiently. In obtaining the results in the paper, we observed the following loss:

Results Loss