/vpg-lunarlander

Primary LanguageJupyter Notebook

Vanilla Policy Gradient with Lunar Lander

Results

Agent before & after training:

agent playing lunar lander before training agent playing lunar lander after training

Training is very noisy:

rewards vs episodes during training steps vs episodes during training avg reward vs iters during eval max reward vs episodes during training

Todo

  • batch episodes for training (currently updating policy after each episode)
  • add baseline and other variance reducing techniques
  • try different algos altogether