This project is my attempt to create an AI that will be able to beat humans at the game of checkers.
Error to be minimized is called temporal difference error (TDE). Episodes are generated as followed: state, action, reward, Next state, next action ... until terminal state. The function q(S, A) is the value of taking action A in state S. The value of the previous state action pair gets updated to be closer to the next reward plus a discount factor gamma multiplied by the maximum action value of the next state.
TDE equation:
The Q function is optimized using backprop:
Tristan Shah
Reinforcement Learning: Ritchard S Sutton Andrew G Barto
This project is licensed under the MIT License - see the LICENSE.md file for details