Conceptual question about DQN when reward is always -1
keithmgould opened this issue · 0 comments
Given that the OpenAI Gym environment MountainCar-v0 ALWAYS returns -1.0 as a reward (even when goal is achieved), I don't understand how DQN with experience-replay converges, yet I know it does, because I have working code (basically your awesome code, that is) that proves it.
It is my understanding that ultimately there needs to be a "sparse reward" that is found. Yet as far as I can see from the openAI Gym code, there is never any reward other than -1. It feels more like a "no reward" environment.
What almost answers my question, but in fact does not: when the task is completed quickly, the return (sum of rewards) of the episode is larger. So if the car never finds the flag, the return is -1000. If the car finds the flag quickly the return might be -200. The reason this does not answer my question is because with DQN and experience replay, those returns (-1000, -200) are never present in the experience replay memory. All the memory has are tuples of the form (state, action, reward, next_state), and of course tuples are pulled from memory at random, not episode-by-episode.
If reaching the flag yielded a reward of +1 (or 100) etc.... things would make more sense to me...
So, I don't see anything in the memory that indicates that the episode was performed well.
And thus, I have no idea why this DQN code is working for MountainCar.
PS: I asked this question on your blog too (as a comment). Apologies for duplication -- I'm not sure where you look and don't look :)