You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Within the Q-learning implementation for Atari (DQN, C51 and QDAgger DQN, both jax and pytorch implementations), then there are different final epsilon values used during training (example at 0.01) and the epsilon value used during evaluation at the end (example at 0.05)
I believe this will result in the Atari environments having unfair evaluations compared to the true agent performance.
I don't think this affects the training curves as we mostly compare the episodic rewards rather than evaluation results but we should fix for users comparing the evaluation results.
This bug appears to have occurred when copying code from the DQN agent where 0.05 is the final epsilon.
Problem Description
Within the Q-learning implementation for Atari (DQN, C51 and QDAgger DQN, both jax and pytorch implementations), then there are different final epsilon values used during training (example at 0.01) and the epsilon value used during evaluation at the end (example at 0.05)
I believe this will result in the Atari environments having unfair evaluations compared to the true agent performance.
I don't think this affects the training curves as we mostly compare the episodic rewards rather than evaluation results but we should fix for users comparing the evaluation results.
This bug appears to have occurred when copying code from the DQN agent where 0.05 is the final epsilon.
Checklist
poetry install
(see CleanRL's installation guideline.Current Behavior
Agent policies should be evaluated with their final epsilon used during training
Expected Behavior
Agent policies are being evaluated at a different and higher epsilon than the training epsilon
Possible Solution
Modify all Q-learning agents to use the evaluation epsilon equal to the final training epsilon
The text was updated successfully, but these errors were encountered: