- Build features: autoencoders on 10-100 states, then use them as features for a simple model.
- Train: perhaps Q-learning idea?
- Appareantly Q-learning can be implemented with continuous states as well, instead of binary ones. Let's try this.
- http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.81.6579&rep=rep1&type=pdf - cross-entropy optimization method
- https://hal.inria.fr/inria-00418930/document - same cross-entropy with more features
- https://hal.inria.fr/inria-00187997v1/document - automatic feature extraction using genetic programming
- http://journals.plos.org/ploscompbiol/article/asset?id=10.1371%2Fjournal.pcbi.1000894.PDF - Unsupervised learning based on the slowness principle (i.e. learning that exploits temporal continuity of real-world stimuli)
-
Think about how to implement online learning: learn while you go through the level (i.e. modify weights by responses in the first half, then cash in on the changes in the second half)
-
Idea from a random redditor: "I mean it might be able to work. But the fact that it's broken up into "levels" and you're only given one level to train on makes it seem like the train and test "levels" are not iid. Plus they give you a simulation but they dont really let you use it (no monte carlo planning). It also excludes actor-critic since then you're basically just learning the training simulation. So what does that leave? Q-learning and TD both require some sort of model, but the levels are IID. You could probably use a variant that's online but that's probably it."
https://en.wikipedia.org/wiki/Temporal_difference_learning
https://en.wikipedia.org/wiki/Q-learning
- Sequences occur: good choice for next action is influenced by the last made choice. However, it is interesting to see that it is not just long sequences of the same action - that would suggest a strong diagonal in this w matrix, which is not present:
[[-0.0335094 , -1.15729392, 1.08289921, 0.93143314],
[-0.52939415, -0.40155876, 0.30389929, -0.58338237],
[ 0.18959498, -0.56440568, 0.33902133, 2.06911397],
[-0.05953684, -0.92938888, 0.26350981, 1.17015839]]
Many states visited 0 times. This is a problem.
Lots of overfitting happens here, we need to fix that and bumyeah!