inarikami/keras-rl2

Agent fails with sequential data

mmansky-3 opened this issue · 0 comments

The Agent implementation fails for data of indeterminate length, such as temporal data. An environment that outputs data of the shape (None, data_dim) fails for an accompanying model with a fitting LSTM as first layer.

It appears that either the Agent or the standard Processor adds another dimension to the observation, causing a shape mismatch between the Environment output and the Model input, raising a ValueError. The shape received from an Environment that outputs (None, 10) is:

ValueError: Input 0 of layer lstm is incompatible with the layer: expected ndim=3, found ndim=4. Full shape received: [1, 1, None, 10]

The first "1" refers to the batch dimension and is to be expected. As an immediate workaround, one can add a squeeze layer to the model, something along the lines of Input>Squeeze>LSTM>Output.

  • Check that you are up-to-date with the master branch of Keras-RL. You can update with:
    pip install git+git://github.com/wau/keras-rl2.git --upgrade --no-deps

  • Check that you are up-to-date with the master branch of Keras. You can update with:
    pip install git+git://github.com/fchollet/keras.git --upgrade --no-deps

  • Provide a link to a GitHub Gist of a Python script that can reproduce your issue (or just copy the script here if it is short). If you report an error, please include the error message and the backtrace.

Example Code:

import rl.memory
import rl.agents
import rl.core
import tensorflow as tf
import numpy as np

BATCH_SIZE = 1
DATA_DIM = 10

class Environment(rl.core.Env):
	def __init__(self, data_dim = 10, game_length = 50):
		self.reward_counter = 0
		self.data_dim = data_dim
		self.game_length = game_length
		self.reward = 0.1
		self.observation = [[0] * self.data_dim]
		self.observation[0][0] = 1
		self.done = False

	def step(self, action):
		action_number = np.argmax(action)
		if not self.reward_counter + action_number % self.data_dim or np.random.rand() < 0.05:
			self.reward *= 1.1
			self.observation.append([0]*self.data_dim)
			self.observation[-1][self.reward_counter%self.data_dim] = 1
			self.reward_counter += 1
			reward = self.reward
			observation = self.observation
		observation = np.array(observation)
		if len(self.observation) > self.game_length and np.random.rand() < 0.05:
			self.done = True
			done = self.done
		info = {}
		return observation, reward, done, info

	def reset(self):
		self.done = False
		self.reward_counter = 0
		self.reward = 0.1
		self.observation = [[0] * self.data_dim]
		self.observation[0][0] = 1
		observation = self.observation
		observation = np.array(observation)
		return observation

	def close(self):
		self.__del__()

if __name__ is '__main__':
	lstm_input = tf.keras.Input(batch_shape = (BATCH_SIZE, 1, None, DATA_DIM))
	# lstm_input = tf.keras.backend.squeeze(lstm_input, 1) # uncomment squeeze layer to fix model.
	x = tf.keras.layers.LSTM(20)(lstm_input)
	x = tf.keras.layers.Dense(10, activation='softmax')(x) # output size doesn't actually matter here
	model = tf.keras.Model(inputs = [lstm_input], outputs = [x])

	memory = rl.memory.SequentialMemory(50000, window_length=BATCH_SIZE)
	processor = rl.core.Processor()

	agent = rl.agents.DQNAgent(model, memory=memory, processor=processor, nb_actions=10, batch_size=BATCH_SIZE)
	agent.compile(optimizer=tf.keras.optimizers.RMSprop(learning_rate=0.01))

	env = Environment(data_dim=DATA_DIM)

	agent.fit(env, nb_steps=int(5e5), log_interval=1000)