qfettes/DeepRL-Tutorials

DQN not learning on stacked frame inputs

MatthewInkawhich opened this issue · 10 comments

Hello! I am trying to train the DQN model (01.DQN) on the Pong task. I changed the frame_stack arg in the wrap_deepmind function to True, however, the model does not learn anything. I was curious if you had any advice for this. Also, I was wondering why your default script uses frame_stack = False? All of the papers appear to recommend feeding 4x84x84 inputs to infer temporal components of the environment such as ball velocity.

Thanks for the nice readable repo!

The hyperparameters here were tuned to allow the agent to learn pong quickly, as opposed to all Atari games offered by OpenAI Gym. Pong is a relatively simple Atari game so framestacking is unnecessary.

If you want to enable framestacking, you'll likely need to tune several of the other hyperparameters. A good place to start would be the hyperparameter values reported in the Nature DQN paper; however, I suspect you could tune those to enable quicker learning if you're only interested in Pong.

@qfettes Got it. Yes, I am looking to train models for multiple Atari games. I will tinker with hyperparams for a while. Any ideas what params in particular would be particularly sensitive for this kind of change? I don't have much experience in RL training.

Thanks again.

All will be important.

Epsilon decay, Target Net Update frequency, experience replay size, and the learning rate are probably the furthest "off" for working on general Atari games.

Also note the original paper only performs an update every 4 timesteps; this agent updates every timestep. Finally, this code uses Huber loss rather than MSE

@qfettes Thanks for the tips. Sorry for the barrage of questions, but doesn't your code also give an update every 4 time steps, due to the fact that we are using the "make_atari" function from the openai baselines to apply the env = MaxAndSkipEnv(env, skip=4) wrapper ? This wrapper should handle the skipping (and action repeating) of 4 time steps for us, no? Or do we have to update every 4 "observed" time steps through the use of this wrapper (equivalent to every 16 "real" time steps with the frame skipping)?

Also, I cannot get the 01.DQN code to train, even without any modifications. The only thing that I have done is copy and paste the code to a .py file, replace the plot function calls with a call to print the average loss and reward over the last 10000 steps, and run on a CUDA server. I am observing no reward gain at all.

Any ideas? Thanks again for your time, I really appreciate it.

Correct. Have a look at Table 1 in "Human-level control through deep reinforcement learning." You'll notice they both skip 4 frame (and repeat the selected action) and update once every 4th action. So 1 update every 16 frames is what was originally published.

I'm unsure what is causing your issue. Running as-is in the IPython notebook should work correctly. Is it possible you changed something by mistake while copying the code?

@MatthewInkawhich I wonder if this might help provide some clarity on frame skipping?

Hi @qfettes, many thanks for this repo. I appreciate the readability of your code, and the implementation of several methods on the same environment (Pong).

Also, I cannot get the 01.DQN code to train, even without any modifications.

I started by running 01.DQN in its original form, and it does not seem to have made progress during training. As suggested in the Readme, I used the relevant code from OpenAI Baselines for the env wrappers. The notebook 01.DQN.pynb is running straight off my pc. I know the as-is hyperparameters are sensible for Pong, since similar values without frame-stacking led to human-level results when running OpenAI Baselines and higgsfield's RL-Adventure on my computer.

Sorry I haven't yet been able to identify possible reasons for the lack of learning. Perhaps I'll put it into a .py for de-bugging. But for now, I thought it might be useful to just put the issue out there.

01 dqn-pong-results

Thank you @MatthewInkawhich for bringing this to my attention and @algebraic-mouse for confirming the issue. After some testing, you both were correct in your assessment. I introduced a bug in the training loop after one of the most recent commits; the bug has been fixed in the latest commit, and a few other QoL changes were made. The other notebooks will be receiving a similar check/update soon!

The rewards aren't increasing.

@BlueDi Double check to make sure you have pulled the most recent version. It has been recently verified to be working correctly.