Review for Lab10 S314872

Question

Review for Lab10 S314872

Closed this issue 9 months ago · 0 comments

The provided code leverages Reinforcement Learning in order to devise a Tic-Tac-Toe player agent.

It adopts the Q-learning paradigm, trying to learn an action-value function (that maps each possible state of the game with the correspondent optimal action to perform), by playing a certain number of matches and obtaining a reward for each move made.

The interesting idea is to give a reward even for non-terminal states, for example if the agent occupies two cells in a row/column/diagonal and the corresponding third one is still available.

The code is clear and appropriately commented, and in a few seconds trains an agent capable of winning/drawing about half of the matches against a random player.

In order to improve the agent's win rate I would suggest to modify the reward function: at present it takes the current state of the game and the action that the agent wants to perform as parameters, but then, in addition to assign a positive reward if that action turns out to be a winning move, it assigns a negative one if that move would have been taken by our opponent, invalidating the Q value computation. My advise is to let the agent play an entire match and then to backpropagate the rewards obtained during the whole match.