[BUG] Different final epsilon and evaluation epsilon for Atari implementations
Opened this issue · 0 comments
Problem Description
Within the Q-learning implementation for Atari (DQN, C51 and QDAgger DQN, both jax and pytorch implementations), then there are different final epsilon values used during training (example at 0.01) and the epsilon value used during evaluation at the end (example at 0.05)
I believe this will result in the Atari environments having unfair evaluations compared to the true agent performance.
I don't think this affects the training curves as we mostly compare the episodic rewards rather than evaluation results but we should fix for users comparing the evaluation results.
This bug appears to have occurred when copying code from the DQN agent where 0.05 is the final epsilon.
Checklist
- I have installed dependencies via
poetry install
(see CleanRL's installation guideline. - I have checked that there is no similar issue in the repo.
- I have checked the documentation site and found not relevant information in GitHub issues.
Current Behavior
Agent policies should be evaluated with their final epsilon used during training
Expected Behavior
Agent policies are being evaluated at a different and higher epsilon than the training epsilon
Possible Solution
Modify all Q-learning agents to use the evaluation epsilon equal to the final training epsilon