This is my first RL agent using a Monte Carlo method. First we generate an episode. Then loop back through each step and assign each state value pair the current value of the expected reward. Once an episode is completed the current policy is updated to be greedy with respect to the current action values. Here is the optimal policy after 20000 games:
Simply run the script to train the agent and generate a graph of the results.
python3 monte.py
numpy, matplotlib
pip3 install numpy
pip3 install matplotlib
Tristan Shah
This project is licensed under the MIT License - see the LICENSE.md file for details