openai/train-procgen

Plotting the episode reward

dotchen opened this issue · 2 comments

Hello,

Thank you so much for releasing the training code of your paper!

I am have a small question regarding the logging of the episode reward: from the baselines's ppo implementation, it seems that the rewards are computed for a fixed number of steps, and the environments are not resetted. For some Procgen tasks, such as maze in easy difficulty mode, this will result in a large reward at step 0, which is a litte weird from the plotting perspective.

I am wondering if the reward plots in the paper are plotted the same way? If not, would you be willing to share how do you plot them? Many thanks!

yes, we will be merging a branch that includes all the plotting logic very soon.

you are correct that there is a logging bias toward short episodes in baselines when the episodic reward buffers are not full. for this reason, we discard the first couple data points when plotting our results. once the buffers are full, this bias disappears.

Awesome, thanks!