Observation and training implementations don't appear to be correct.
Opened this issue · 0 comments
The observation implementation seems broken. Random sampling is done from the game state with probability 'epsilon', which makes sense, but then the model is sampled with probability (1-epsilon); however, during this time, the model is completely untrained, is not updated during the observation process and is producing outputs that are not related to the problem at all.
Not only that, but in the phase where you're supposed to be training the model, you're taking predictions from the untrained model as your targets, basically teaching the model to reproduce random results based on it's own uniform weight initialization. So far as I can tell, this behavior translates well to the final 'play' phase, as the agent is never able to complete the game and behaves fairly randomly, running from the stock implementation.
Perhaps I'm confused, but, could you elaborate on what you think is happening in each phase of the notebook?