In this project, we learn how to implement several agents to play Atari Games including Policy Gradient, Deep Q-Learning (DQN), and Advantange-Actor-Critic (a2c).
- Git clone the code and install package
git clone https://github.com/hsiehjackson/
pip install -r requirements
- Download files and extract zip file
bash download.sh
unzip download.zip
- Training
python main.py --train_pg --pg_type=pg || pg_nor || pg_ppo --folder_name=[YOUR NAME]
- Testing (your/my best)
python main.py --test_pg --model_path=[YOUR MODEL PATH]
python main.py --test_pg --model_path=./model/best/pg.cpt
- Training
python main.py --train_dqn --dqn_type=DQN || DoubleDQN || DuelDQN || DDDQN --folder_name=[YOUR NAME]
- Testing (your/my best)
python main.py --test_dqn --dqn_type=DQN || DoubleDQN || DuelDQN || DDDQN --model_path=[YOUR MODEL PATH]
python main.py --test_dqn --dqn_type=DDDQN --model_path=./model/best/dqn.cpt
- Training
python main.py --train_a2c --folder_name=[YOUR NAME]
- Testing (your/my best)
python main.py --test_a2c --model_path=[YOUR MODEL PATH]
python main.py --test_a2c --model_path=./model/best/a2c.cpt
- Plot training progress
python plot.py ./model/[folder_name]/plot.json
We use three Atari Games to test our performance separately, such as LunarLander, Assault, and Mario. You can click the hyperlinks to see the game rules. The following GIFs are my best results.
LunarLander | Assault | Mario |
---|---|---|
We implement policy gradient agents with REIFORCE algorithm. However, I also use some improvements including reward normalization and proximal policy optimization (PPO).
- Reward Normalization
Due to all positive rewards, we can subtract a baseline (normalized) to let rewards have negative value. With baseline, the probability of the not sampled actions will not decrease sharply
- Proximal policy optimization
PPO had implemented off-policy algorithm with important sampling, which set KL divergence constraints for that θ cannot very different from θ'. The objective function is shown below.
Besides classic DQN algorithm, we also implement some improvements for DQN, such as Double DQN and Dueling DQN.
- Double DQN
It is used to solve over-estimated problems. With two networks (online and target), they can compensate for the other to avoid over-estimated q value. This method is only need another copied network to acquire target actions.
- Dueling DQN
Dueling DQN is used to acquire the state q value among each actions and set a normalized constraint for different action q value. With this method, we can update the action even if we don’t sample on it, which is more efficient.
Double DQN | Dueling DQN |
---|---|
Different from general a2c framework, we also implement proximal policy optimization (PPO) and Generalized Advantage Estimation (GAE) on multi-processing environment with value loss, action loss, and entropy loss. This method can consider the KL-divergence constraints and train more iteration on one steps. Our framework was reference from here.
It is obvious that reward normalization can have better results than baselin. However, PPO cannot show little improvements but only reduce the unstable variance. Perhaps its ability isn't apparent on LunarLander.
Policy Gradient |
---|
While we can find better results than baseline DQN with several improved techniques, the joint use of DoubleDQN and DuelingDQN cannot get the highest performance. These results suggest DuelingDQN has a great ability to improve DQN framework on Assault.
Deep Q-Learning (DQN) |
---|
As for the results of A2C, we can find that the performance with PPO is higher than without PPO, which is different from policy gradient results. Without PPO, the performance will go decayed after more training steps, showing the difficulty to learn a good agent on Mario.
Advantage-Actor-Critic (A2C) |
---|
Policy Gradient | Deep Q-Learning | Advantange-Actor-Critic | |
---|---|---|---|
Games | LunarLander | Assault | Mario |
Test Episodes | 30 | 100 | 10 |
Average Rewards | 90.11 | 275.95 | 3243.30 |