Deep Reinforcement Learning Networks

A list of deep neural network architectures for reinforcement learning tasks.

Paper	Domain	Model	Architecture	Source code
Mnih et al., 2013	Atari	DQN (NIPS version)	The first `conv` layer: 16 filters of 8×8 with stride 4. The second layer: 32 filters of 4×4 filters with stride 2. The final hidden layer is `fc` and consists of 256 units. All hidden layers were followed by `ReLU`.
Mnih et al., 2015	Atari	DQN (Nature version)	The first `conv` layer: 32 filters of 8×8 with stride 4. The second layer: 64 filters of 4×4 with stride 2. The third layer: 64 filters of 3×3 with stride 1. The final hidden layer is `fc` and consists of 512 units. All hidden layers were followed by `ReLU`.	Torch, deepmind
Mnih et al., 2016	Atari, MuJoCo, Labyrinth, TORCS	A3C (Asynchronous Advantage Actor-Critic)	Atari: The agents used the network architecture from (Mnih et al., 2013) as well as a recurrent agent with an additional 256 `LSTM` cells after the final hidden layer. Mujoco: In the low dimensional physical state case, the inputs are mapped to a hidden state using 1 hidden layer with 200 `ReLU` units. In the pixels case, the input was passed through 2 `conv` layers without any non-linearity or pooling. In either case, the output of the encoder layers were fed to a single layer of 128 LSTM cells. Labyrinth: A3C LSTM agent trained on this task using only 84×84 RGB images as input.
Hausknecht et al., 2015	Atari	DRQN (Recurrent DQN)	For input, the recurrent network takes a single 84×84 preprocessed image. Convolutional outputs of DQN are fed to `LSTM` layer with 512 cells.	Caffe, mhauskn
Sorokin et al., 2015	Atari	DARQN (Attention Recurrent DQN)	The input is a 84 × 84 × 1 tensor, and the output of its last (third) `conv` layer contains 256 feature maps 7 × 7. The attention network takes 49 vectors as input, each vector has a dimension of 256. The number of hidden units in the attention network is chosen to be equal to 256. The LSTM network also has 256 units, which is consistent with the number of attention network outputs.	Torch, 5vision
Lillicrap et al., 2016	MuJoCo, TORCS	Deep DPG (Deterministic Policy Gradient)	The low-dimensional networks had 2 hidden layers with 400 and 300 units respectively (≈ 130,000 parameters). Actions were not included until the 2nd hidden layer of Q. In the pixels case, 3 `conv` layers (no pooling) with 32 filters at each layer. This was followed by two `fc` layers with 200 units (≈ 430,000 parameters). In the low-dimensional case, `batch normalization` is used on the state input and all layers of the μ network and all layers of the Q network prior to the action input. The final output layer of the actor was a `tanh` layer, to bound the actions. All hidden layers were followed by `ReLU`.	Torch, iassael
Gu et al., 2016	MuJoCo	NAF (Normalized Adantage Functions)	For both method and the prior DDPG (Lillicrap et al., 2016) algorithm in the comparisons, networks have 2 layers of 200 `ReLU` to produce each of the output parameters – the Q-function and policy in DDPG, and the value function V, the advantage matrix L, and the mean μ for NAF.
Schulman et al., 2015	MuJoCo, Atari	TRPO (Trust Region Policy Optimization)	Locomotion tasks: Swimmer - 30, Hopper - 50 and Walker - 50 hidden units. Atari: two `conv` layers with 16 channels and stride 2, followed by one `fc` layer with 20 units, yielding 33,500 parameters.	Theano, joschu
Duan et al., 2016	Box2D, MuJoCo	Benchmarking: REINFORCE, TNPG, RWR, REPS, TRPO, CEM, CMA-ES, DDPG	For basic, locomotion, and hierarchical tasks and for batch algorithms, network policy has 3 hidden layers, consisting of 100, 50, and 25 hidden units with `tanh` nonlinearity at the first two hidden layers, which map each state to the mean of a Gaussian distribution. For all partially observable tasks, we use `LSTM` with 32 hidden units.	Theano, rllab
Mohamed and Rezende, 2015	Room environment (lava-filled maze, key/predator scenarios)	Stochastic variational information maximisation	The first `conv` layer: 10 filters of 4×4 with stride 1, and the second: 10 filters 3×3 with stride 2. The output of the convolution is passed through a `fc` layer with 100 hidden units. All hidden layers were followed by `ReLU`.
Blundell et al., 2016	Atari, Labyrinth	Model-Free Episodic Control	In all experiments the encoder has four `conv` layers using {32, 32, 64, 64} kernels respectively, kernel sizes {4, 5, 5, 4}, kernel strides {2, 2, 2, 2} , no padding, and `ReLU` non-linearity. The `conv` layer are followed by a `fc` layer of 512 `ReLU` units, from which a linear layer outputs the means and log-standard-deviations of the approximate posterior q(z\|x), where z is 32 dimension vector and x - 7056 (84*84). The decoder is setup mirroring the encoder.
Houthooft et al., 2016	MuJoCo	VIME (Variational Information Maximizing Exploration)	Bayesian NN: In case of the classic tasks, it has one hidden layer of 32 units. For the locomotion tasks, it has two hidden layers of 64 units each. All hidden layers were followed by `ReLU`. NN policy: The classic tasks make use of a network with one layer of 32 `tanh` units, while the locomotion tasks make use of a two-layer network of 64 and 32 `tanh` units. The classic tasks make use of a network baseline with one layer of 32 `ReLU` units, while the locomotion tasks make use linear baseline function.	Theano, openai
Ho and Ermon, 2016	MuJoCo	Generative Adversarial Imitation Learning	The same neural network architecture for all tasks: two hidden layers of 100 units each, with `tanh` nonlinearities in between.	Theano, openai
Levine et al., 2015	PR2 robot	Visuomotor Policy	The images were downsampled to 240×240×3. The network contains 3 `conv` layers (one with 64 filters of 7×7 with stride 2 and two layers with 32 filters 5×5), followed by a `spatial softmax` and an `expected position` layer that converts pixel-wise features to 64 feature points. The points are concatenated with 39 robot's configurations, then passed through 3 `fc` layers (40, 40 and 7 units) to produce the torques. The network has 7 layers and around 92,000 parameters.
Watter et al., 2015	Visual version of the classic tasks	Embed to Control (E2C)	Plane: Encoder: 150 - 150 - 150 - 4 Linear (2 for AE). Decoder: 200 - 200 - 1600 Linear (Sigmoid for AE). Dynamics: 100 - 100 + Output layer. Pendulum swing-up: Encoder: 800 - 800 - 6 Linear (3 for AE). Decoder: 800 - 800 - 4608 Linear (Sigmoid for AE). Dynamics: 100 - 100 + Output layer. Cart-Pole balancing: Encoder: 32×5×5 - 32×5×5 - 32×5×5 - 512 - 512. Decoder:512 - 512 - 2×2 up-sampling - 32×5×5 - 2×2 up-sampling - 32×5×5 - 2×2 up-sampling - 32×5×5. Dynamics: 200 - 200 + 32 Linear. Three-link arm: Encoder: 64×5×5 - 2×2 max-pooling - 32×5×5 - 2×2 max-pooling - 32×5×5 - 2×2 max-pooling - 512 - 512. Decoder: 512 - 512 - 2×2 up-sampling - 32×5×5 - 2×2 up-sampling - 32×5×5 - 2×2 up-sampling - 64×5×5. Dynamics: 200 - 200 + 48 Linear. All hidden layers were followed by `ReLU`.
Assael et al., 2015	Pendulum (pixels-to-torques)	DDM (Deep Dynamical Models)	Planar Pendulum: The screenshots 40×40 = 1600 reduced to 100 using `PCA`. f_enc: 100×50 – 50×50 – 50×2, f_pred: 5×100 – 100×100 – 100×2. f_dec: 2×50 – 50×50 – 50×100. Planar Double Pendulum: The screenshots 48×48 = 2304 reduced to 512 using `PCA`. f_enc and f_dec: 512×256 – 256×256 – 256×4, f_pred: 10×200 – 200×200 – 200×4. All hidden layers were followed by `ReLU`.
Mordatch et al., 2015	MuJoCo	Interactive control policy	All experiments use 3 hidden layer neural networks with 250 hidden units in each layer and `tanh` activation function.
Peng et al., 2016	BulletPhysics	MACE (Mixture of Actor-Critic Experts)	The first `conv` layer: 16 filters of 8×1. The second layer: 32 filters 4×1. The third layer: 32 filters of 4×1. A stride of 1 is used for all `conv` layers. The output of the final `conv` layer is processed by 64 `fc` units, and the resulting features are then concatenated with the character features. The combined features are processed by a `fc` layer composed of 256 units. The network then branches into critic and actor subnetworks with a `fc` layer of 128 units followed by a linear output layer. The size of the output layers vary depending on the subnetwork, ranging from 3 output units for the critics to 29 units for each actor. The combined network has approximately 570k parameters. All hidden layers were followed by `ReLU`.	Caffe, xbpeng
Parisotto et al., 2015	Atari	Actor-Mimic	The network used for transfer consisted of the following architecture: 8×8×4×256-4 → 4×4×256×512-2 → 3×3×512×512-1 → 3×3×512×512-1 → 2048 `fc` units → 1024 `fc` units → 18 actions. All hidden layers were followed by `ReLU`.	Torch, eparisotto
Rusu et al., 2016; Raia Hadsell slides	Atari, Labyrinth, MuJoCo, Jaco arm	Progressive nets	Atari: a model with 3 `conv` layers followed by a `fc` layer from which the policy and value function are predicted. The `conv` layers have 12 feature maps. The first layer has a kernel of size 8x8 and a stride of 4x4. The second layer has a kernel of size 4 and a stride of 2. The third layer has size 3x4 with a stride of 1. The `fc` layer has 256 hidden units.
Oh et al., 2015	Atari	Action-Conditional	The encoding layers consist of 4 `conv` layers and 1 `fc` layer with 2048 hidden units. The `conv` layers use 64 (8×8), 128 (6×6), 128 (6×6), and 128 (4×4) filters with stride of 2. Every layer is followed by `ReLU`. In the recurrent encoding network, an `LSTM` layer with 2048 hidden units is added on top of the `fc` layer. The number of factors in the transformation layer is 2048. The decoding layers consists of one `fc` layer with 11264 (= 128×11×8) hidden units followed by 4 `deconv` layers with 128 (4×4), 128 (6×6), 128 (6×6), and 3 (8×8) filters with stride of 2.	Caffe, junhyukoh
Stadie et al., 2015	Atari	Incentivizing exploration	The autoencoder has 8 hidden layers (1000-500-250-128-250-500-1000-7056 units), followed by a Euclidean loss layer.
Sukhbaatar et al., 2015	MazeBase, StarCraft	Memory network	ConvNet: 4 `conv` layers (the first layer has 1×1 kernel, which essentially makes it an embedding of words). Items without spatial location (e.g. “Info” items) are each represented as a bag of words, and then combined via a `fc` layer to the outputs of the `conv` layers; these are then passed through 2 `fc` layers to output the actions (and a baseline for reinforcement). MemNN: the architecture from (Sukhbaatar et al., 2015) is used with 3 hops and `tanh` non-linearities.	Torch, facebook
Kulkarni et al., 2016	MazeBase, ViZDoom	DSR (Deep Successor Representation)	Feature branch is CNN with four layers: 32 (8×8), 64 (4×4), 64 (3×3), 512 and additional fifth layer with 512 `tanh` units (equal to SR dimension). Intrinsic reward decoder: 512 (4×4), 256 (4×4), 128 (4×4), 64 (4×4), 3 (4×4). Successor branch: 512, 256, 512. All hidden layers were followed by `ReLU`.	Torch, Ardavans
Sunehag et al., 2015	Recommendation system	Slate-MDP (high-dimensional control)	For all agents' Q-functions, neural networks have 2 hidden layers, each with a 100 units. The policies are feed-forward neural networks with 2 hidden layers with 25 hidden units each.
...

Deep Reinforcement Learning Papers

4SkyNet/deep-reinforcement-learning-networks

Deep Reinforcement Learning Networks

Deep Reinforcement Learning Papers