/TimberMan-Reinforcement-Learning

Apply Deep Q-Network (DQN) algorithm to learn how to play TimberMan steam game.

Primary LanguagePythonMIT LicenseMIT

TimberMan Reinforcement Learning

In this work I seek to develop a reinforcement learning algorithm, based solely on the screen image. The algorithm must be able to reach 1000 points in the chosen environment. This environment, is a game available on the steam platform called TimberMan.


Environment

Environment Neon Drive is a slick retro-futuristic and '80s inspired arcade game. This game has a very simple purpose, to deviate from fixed abstacles over time using 2 types of discrete actions: left and right. In this case, TimberMan will serve as the environment for our reinforcement learning algorithm. To use this algorithm you need to open the game and enter in singleplayer mode. Let your timberman hit an obstacle, in the restart screen run the following command:

sudo python3 dqn.py --resolution 1920x1080 --train policy_net.pth

Then go back to the game screen and let the algorithm work. In case there's any doubt, you need to run with sudo because of the keyboard module.

Obs: If you have problems with terminal environment variables please add -E after sudo.

Setup

All of requirements is show in the badgets above, but if you want to install all of them, enter the repository and execute the following line of code:

pip3 install -r requirements.txt

Deep Q-Network

Neural networks can usually solve tasks just by looking at the location, so let's use a piece of the screen centered on the car as an input. By using only image our task becomes much more difficult. Since we cannot render multiple environments at the same time, we need a lot of training time. Strictly speaking, we will present the state as the difference between the current screen patch and the previous one. This will allow the agent to take the velocity of the obstacles into account from one image.

Our model will be a convolutional neural network that takes in the difference between the current and previous screen patches. It has three outputs, representing equation, equation and equation where equation is the input to the network. In effect, the network is trying to predict the expected return of taking each action given the current input.

Image Processing

The image processing performed in this work is quite simple, but it is very important for the overall functioning of the algorithm. Through the mss module the screen was captured and transformed into a numpy array variable. With the BGR screen saved, we applied a color filter available in the OpenCV module to transform everything to grayscale. We cut 53.84% of the upper pixels, 20% of the lower pixels, and 20% of the left and right pixels. After that, we applied the triangle threshhold function to transform the image to black and white. Finally, we resize the final image to 160x90 pixels using area interpolation and invert all of the binary pixels. You can follow the steps of this process in the following images:

Image Process Comparation

In order, the respective images are: normal input image, image converted to grayscale, cropped image with triangle threshold and lastly the cropped image with triangle threshold and all the binary pixels inverted.

Reward Function

Reward As the agent observes the current state of the environment and chooses an action, the environment transitions to a new state, and also returns a reward that indicates the consequences of the action. In this task, rewards are +1 for every incremental timestep and the environment terminates if the car hits the obstacle. This means better performing scenarios will run for longer duration, accumulating larger return.

Our aim will be to train a policy that tries to maximize the discounted, cumulative reward equation , where equation is also known as the return. The discount, equation , should be a constant between 0 and 1 that ensures the sum converges. It makes rewards from the uncertain far future less important for our agent than the ones in the near future that it can be fairly confident about.

Results

After the training is done a file called data.csv is generated. From this file we can show some important information such as: reward history, number of steps per epochs and noise interference on the data. To view your graphs run the following command:

python3 data_visualization.py --file data.csv

The following images show the graphics generated by the data.csv file already present in the repository.

Rewards Graphs Comparation

If you liked this repository, please don't forget to starred it!