clip of a pretrained model playing snake (2.5M timesteps, 12 parallel Envs trained)
2022-06-03.10-11-07.mp4
This project contains my own implementation of snake in the form of a stable_baselines3 VecEnv, and a training script that trains an agend based on PPO. To get more information on stable_baselines3's PPO, check out https://stable-baselines3.readthedocs.io/en/master/modules/ppo.html
The observation space is a 12 dimensional vector, containing:
- relative apple direction
- e.g. [1, 0, 0, 0] -> apple is above
- snake head direction
- e.g. [0, 1, 0, 0] -> snake is headed right
- is there an obstacle next to the head?
- e.g. [0, 1, 0, 1] there is an obstacle to the left and right of the snake's head
the loss function:
- eating an apple: +100
- dying: -100
- for every step: ((1 / distance(apple, head)) - 0.5) * 10
TODO:
- play around with observation space
- (context: In the beginning, the observation space was just the entire frame. The snake didn't seem to be improving a lot, so maybe there are even better observation spaces)
- play around with loss function
- e.g. change the distance reward to "punish for going further away, reward for getting closer" instead of this distance formula