vwxyzjn/cleanrl

[BUG] Different final epsilon and evaluation epsilon for Atari implementations

Opened this issue · 0 comments

Problem Description

Within the Q-learning implementation for Atari (DQN, C51 and QDAgger DQN, both jax and pytorch implementations), then there are different final epsilon values used during training (example at 0.01) and the epsilon value used during evaluation at the end (example at 0.05)

I believe this will result in the Atari environments having unfair evaluations compared to the true agent performance.

I don't think this affects the training curves as we mostly compare the episodic rewards rather than evaluation results but we should fix for users comparing the evaluation results.

This bug appears to have occurred when copying code from the DQN agent where 0.05 is the final epsilon.

Checklist

Current Behavior

Agent policies should be evaluated with their final epsilon used during training

Expected Behavior

Agent policies are being evaluated at a different and higher epsilon than the training epsilon

Possible Solution

Modify all Q-learning agents to use the evaluation epsilon equal to the final training epsilon