[107-2] Applied Deep Learning - Deep Reinforcement Learning on Atari Games

In this project, we learn how to implement several agents to play Atari Games including Policy Gradient, Deep Q-Learning (DQN), and Advantange-Actor-Critic (a2c).

Usage

Git clone the code and install package

git clone https://github.com/hsiehjackson/
pip install -r requirements

Download files and extract zip file

bash download.sh
unzip download.zip

Policy Gradient

Training

python main.py --train_pg --pg_type=pg || pg_nor || pg_ppo --folder_name=[YOUR NAME]

Testing (your/my best)

python main.py --test_pg --model_path=[YOUR MODEL PATH]
python main.py --test_pg --model_path=./model/best/pg.cpt

Deep Q-Learning

Training

python main.py --train_dqn --dqn_type=DQN || DoubleDQN || DuelDQN || DDDQN --folder_name=[YOUR NAME]

Testing (your/my best)

python main.py --test_dqn --dqn_type=DQN || DoubleDQN || DuelDQN || DDDQN --model_path=[YOUR MODEL PATH]
python main.py --test_dqn --dqn_type=DDDQN --model_path=./model/best/dqn.cpt

Advantange-Actor-Critic

Training

python main.py --train_a2c --folder_name=[YOUR NAME]

Testing (your/my best)

python main.py --test_a2c --model_path=[YOUR MODEL PATH]
python main.py --test_a2c --model_path=./model/best/a2c.cpt

Others

Plot training progress

python plot.py ./model/[folder_name]/plot.json

Environment Introdution

We use three Atari Games to test our performance separately, such as LunarLander, Assault, and Mario. You can click the hyperlinks to see the game rules. The following GIFs are my best results.

LunarLander	Assault	Mario

Techniques for Deep Reinforcement Learning

Policy Gradient

We implement policy gradient agents with REIFORCE algorithm. However, I also use some improvements including reward normalization and proximal policy optimization (PPO).

Reward Normalization

Due to all positive rewards, we can subtract a baseline (normalized) to let rewards have negative value. With baseline, the probability of the not sampled actions will not decrease sharply

Proximal policy optimization

PPO had implemented off-policy algorithm with important sampling, which set KL divergence constraints for that θ cannot very different from θ'. The objective function is shown below.

Deep Q-Learning (DQN)

Besides classic DQN algorithm, we also implement some improvements for DQN, such as Double DQN and Dueling DQN.

Double DQN

It is used to solve over-estimated problems. With two networks (online and target), they can compensate for the other to avoid over-estimated q value. This method is only need another copied network to acquire target actions.

Dueling DQN

Dueling DQN is used to acquire the state q value among each actions and set a normalized constraint for different action q value. With this method, we can update the action even if we don’t sample on it, which is more efficient.

Double DQN	Dueling DQN

Advantage-Actor-Critic (A2C)

Different from general a2c framework, we also implement proximal policy optimization (PPO) and Generalized Advantage Estimation (GAE) on multi-processing environment with value loss, action loss, and entropy loss. This method can consider the KL-divergence constraints and train more iteration on one steps. Our framework was reference from here.

Training Procedure Results

It is obvious that reward normalization can have better results than baselin. However, PPO cannot show little improvements but only reduce the unstable variance. Perhaps its ability isn't apparent on LunarLander.

Policy Gradient

While we can find better results than baseline DQN with several improved techniques, the joint use of DoubleDQN and DuelingDQN cannot get the highest performance. These results suggest DuelingDQN has a great ability to improve DQN framework on Assault.

Deep Q-Learning (DQN)

As for the results of A2C, we can find that the performance with PPO is higher than without PPO, which is different from policy gradient results. Without PPO, the performance will go decayed after more training steps, showing the difficulty to learn a good agent on Mario.

Advantage-Actor-Critic (A2C)

Reward Results

	Policy Gradient	Deep Q-Learning	Advantange-Actor-Critic
Games	LunarLander	Assault	Mario
Test Episodes	30	100	10
Average Rewards	90.11	275.95	3243.30

hsiehjackson/Deep-Reinforcement-Learning-on-Atari-Games

[107-2] Applied Deep Learning - Deep Reinforcement Learning on Atari Games

Usage

Policy Gradient

Deep Q-Learning

Advantange-Actor-Critic

Others

Environment Introdution

Techniques for Deep Reinforcement Learning

Policy Gradient

Deep Q-Learning (DQN)

Advantage-Actor-Critic (A2C)

Training Procedure Results

Reward Results