Abstract
A model is a convolutional neural network, trained with a variant of Q-learning, whose input is raw pixels and whose output is a value function.
Introduction
Most successful RL applications have relied on hand-crafted features combined with linear value functions or policy representations. Recent advances in deep learning have made it possible to extract high-level features from raw sensory data.
several challenges from a deep learning perspective:
- large amounts of handlabelled training data.
- The delay between actions and resulting rewards.
- RL encounters sequences of highly correlated states.
Use an experience replay mechanism which randomly samples previous transitions, and thereby smooths the training distribution over many past behaviors to alleviate the problems of correlated data and non-stationary distributions.
Background
Define environment E - Atari emulator
At each time-step, the agent selects an action at from the set of legal game actions, A = {1, . . . , K}.
The action is passed to the emulator and modifies its internal state and the game score.
The agent observes an image from the emulator, which is a vector of raw pixel values representing the current screen and reward representing the change in game score instead of internal state.
Since it is impossible to fully understand the current situation
from only the current screen, they consider sequences of actions and observations, and learn game strategies that depend upon these sequences. As a result, they can apply standard reinforcement learning methods for MDP(Markov Decision Process) where each sequence is a distinct state.
The goal of the agent: Maximize future rewards.
Make assumption: future rewards are discounted by a factor of per time-step.
Define the future reward at time t as where T is the terminal time.
Define optimal action-value function where is a policy mapping sequences to actions (distributions over actions)
The basic idea behind RL algorithm is to estimate the action-value function, by using Bellan equation as an iterative update, which is called value-iteration algorithm.
However, this approach is impractical because action-value function is estimated separately[독립적으로] for each sequence. (각각의 sequence (고차원의 data)에 대해서 함수를 estimate하기에는 시간과 메모리 문제) Therefore, it's common to use a function approximator (sequence의 경향성(parameter)을 통해 함수화 시켜놓는 것),
They refer to a neural network function approximator with weights as a Q-network. A Q-network can be trained by minimising a sequence of loss functions that changes at each
iteration i,
The parameters from the previous iteration are held fixed when optimising the loss function. (target의 값이 θ의 값에 민감하게 영향을 받기 때문에 stable한 learning을 위하여 θ값을 고정하는 것이다) Differentiating the loss function
with respect to the weights we arrive at the following gradient,
Deep Reinforcement Learning
- Experience replay
- After performing experience replay, the agent selects and executes an action according to an -greedy policy.
- our Q-function instead works on fixed length representation of histories produced by a function
The full algorithm, which we call deep Q-learning, is presented in Algorithm 1.
- Advantage:
- each step of experience is potentially used in many weight updates, which allows for greater data efficiency.
- randomizing the samples breaks these correlations and therefore reduces the variance of the updates.
- when learning on-policy the current parameters determine the next data sample that the parameters are trained on.
- The algorithm is model-free and off-policy
- model-free: Agent learns Trial-and-Error not planning(model-based)
- off-policy: Divide a policy which is refered to action and one which is updated. (on-policy uses the same policy on two parts.)
Pre-processing and Model Architecture
- raw Atari frames: 210 x 160 pixel images with 128 color palette.
- gray-scale and down-sampling it to a 110 x 84 image.
- crop 84 x 84 image.
- the function implies pre-processing to the last 4 frames of a history and stacks them to produce the input to the Q-function.
- use an architecture in which there is a separate output unit for each possible action, and only the state representation is an input to the neural network.
Experiments
- Since the scale of scores varies greatly from game to game, we fixed all positive rewards to be 1 and all negative rewards to be −1, leaving 0 rewards unchanged.
- used the RMSProp algorithm with minibatches of size 32.
- use a simple frame-skipping technique. More precisely, the agent sees and selects actions on every kth frame instead of every frame.
References