A feedforward neural network is used with three hidden layers.The NN employs π ππΏπ’ activation function for its layers. The input layer consists of 11 nodes from the state of the snake, the output layer consists the three action nodes that the snake takes,i.e. the direction it can move in.
The πππ‘ππππ are the choices made by the agent The π π‘ππ‘ππ are the basis for making the choices The πππ€ππππ are the basis for evaluating the choices
Deep Q Algorithm-
- Initialise Q value /n
- Choose action to be performed, the action selection policy is e-greedy
- Perform action(π΄π) for the time step π and measure the award (π π) associated with that action
- Update the Q value for the action π΄π
Each rule that dicates how actions are done as function of state are called ππππππππ . Each policy has a π£πππ’π ππ’πππ‘πππ which associates every action-state pair to an expected return, if that state-action pair is performed. An πππ‘ππππ π£πππ’π ππ’πππ‘πππ assigns the largest expected return to each state, or state-action pair. We will be using the Bellman optimality function here to derive these optimal value functions
The 11 states that we will use are [direction left, direction right, direction up, direction down], [food up, food down, food right, food left], [danger straight,danger right, danger left]. The moves are choosen by an πππππ¦πππ πππ ππππ ππππππ¦ algorithm.