germain-hug/Deep-RL-Keras

Can’t actor-critic method solve mountain car environment?

alanyuwenche opened this issue · 2 comments

I tried your code on mountain car environment by A2C. It showed no progress (always -200) no matter how long it was trained. However, this problem can be easily solved by DQN or DDQN algorithm.

Actually I used my own program in mountain car and encountered the same problem. That is why I started to study your code. Can’t actor-critic method solve mountain car environment? If no, do you know the reason?

Most implementations I've seen that solve MC in a reasonable amount of time use reward shaping. I.e. the reward is set based on, e.g., the position of the vehicle. This should make sense intuitively: With the default reward, the agent essentially has nothing beneficial to learn until it randomly happens across the finish line. If you run a random agent on the environment, it's likely not to cross the finish line for millions and millions of time-steps. With the default rewards, (D)DQN might perform better than AC due to e-greedy exploration, but it's still going to suck and be slow.

Anyway, there are a bunch of articles about MC on Medium that discuss reward shaping. Here's one that shows some charts with and without modifying the reward: https://medium.com/@ts1829/solving-mountain-car-with-q-learning-b77bf71b1de2

Thanks for your help!