/dqn_navigation_project

Collecting Bananas using Deep Q Learning in a unity environment using ml-agents.

Primary LanguageASP

Project 1: Navigation

Introduction / Project Details

For this project, you will train an agent to navigate (and collect bananas!) in a large, square world.

Trained Agent

A reward of +1 is provided for collecting a yellow banana, and a reward of -1 is provided for collecting a blue banana. Thus, the goal of your agent is to collect as many yellow bananas as possible while avoiding blue bananas.

The state space has 37 dimensions and contains the agent's velocity, along with ray-based perception of objects around agent's forward direction. Given this information, the agent has to learn how to best select actions. Four discrete actions are available, corresponding to:

  • 0 - move forward.
  • 1 - move backward.
  • 2 - turn left.
  • 3 - turn right.

The task is episodic, and in order to solve the environment, your agent must get an average score of +13 over 100 consecutive episodes.

Getting Started

  1. Download the environment from one of the links below. You need only select the environment that matches your operating system:

Instructions for setting up environment in AWS

  1. First submit a ticket with amazon for a rate limit increase of all P and G instances in the region you will be using. Udacity provided a link for how you can build and configure your own compute instance; However the repository readme says their documentation is out of date and the no longer use it. Provided are 2 links: first the repository/documentation in question, and second the link to the amazon machine image which I am recommending the rate increase to use. repo AMI aws link

  2. Once your request has been approved go to the AWS link above to create the EC2 compute instance you will be using. Note: before doing this you must create a SSH key pair which can be done in the EC2 console. The Key pair link can be found in the console under Network & Security -> Key Pairs.

  3. after you have created your key and created the EC2 instance we will want to SSH into it. The following links provides you with detailed instructions for doing this with putty as well as mac + openshh. Connecting to Your Linux Instance from Windows Using PuTTY Connect from Mac or Linux Using an SSH Client

  4. Now that you have access to the environment we want to connect to a jupyter notebook. Since we have SSH'ed into the machine we have to perform an extra step to access it via the browser on our loca machine.

  • setup a password to access jupyter
(ec2-instance)$ jupyter notebook password
Enter password: 
Verify password: 
[NotebookPasswordApp] Wrote hashed password to /home/ubuntu/.jupyter/jupyter_notebook_config.json
  • next start the jupyter notebook. NOTE: we use nohup at the beggining and & at the end to stop the server from being killed if you logout.
(ec2-instance)$ nohup jupyter notebook --no-browser --port=8888
nohup: ignoring input and appending output to 'nohup.out'
  • finally we create an SSH tunnel connection from the EC2 instance to your local machine.
(my-machine)$ ssh -i my-private-key.pem -N -f -L localhost:8888:localhost:8888 user-name@remote-hostname

Report

Learning Algorithm

For this Project I used the standard DQN. For network arctitecture, a simple target network was used to prevent the model from chasing "moving target" Q values, and an experience replay that was sampled randomly to further learn from previous action state combinations. The neural networks themselves were rudimentary at best, they are 3 layer fully connected linear layers using rectified linear unit activation functions. For hyperparameters I used the following:

Name value
starting Epsilon 1.0
minimum Epsilon 0.01
Epsilon Decay 0.995
max number of episodes 2000, actual: 1416
max timesteps per episode 1000
starting Epsilon 1.0

I regret using this algorithm because it burned up all of my workspace GPU time, and cost me about a day of compute time on AWS. I knew the oscillations would be attrocious; However, I did not account for the compute time required to train a DQN. I read that there were far fewer oscillations with the double DQN and the dueling DQN, but they did not as effectively maximize the Q values for which reason I stuck with the DQN.

Plot of Rewards

This graph does a great job of illustrating the oscillations I mentioned above and why training the vanilla DQN was so computationally expensive.

Ideas for future work

I have many Ideas for future work. First, I am going to try the double DQN because I am interested in the trade off between minimization of the oscilations and overall maximization of the Q values. Next I will try the prioritized experience replay because in my reading I have gleaned that it better maximizes the q values. Finally I will combine the two algorithms to see if they achieve greater performance together rather than separate.

(Optional) Challenge: Learning from Pixels

After you have successfully completed the project, if you're looking for an additional challenge, you have come to the right place! In the project, your agent learned from information such as its velocity, along with ray-based perception of objects around its forward direction. A more challenging task would be to learn directly from pixels!

To solve this harder task, you'll need to download a new Unity environment. This environment is almost identical to the project environment, where the only difference is that the state is an 84 x 84 RGB image, corresponding to the agent's first-person view. (Note: Udacity students should not submit a project with this new environment.)

You need only select the environment that matches your operating system:

Then, place the file in the p1_navigation/ folder in the DRLND GitHub repository, and unzip (or decompress) the file. Next, open Navigation_Pixels.ipynb and follow the instructions to learn how to use the Python API to control the agent.

(For AWS) If you'd like to train the agent on AWS, you must follow the instructions to set up X Server, and then download the environment for the Linux operating system above.