maximization bias
mikelty opened this issue · 1 comments
Hello, I'm not sure whether this is an issue or not but I've been looking at your implementation for half an hour, and I think there might be a maximization bias in the implementation. Specifically, you used the same set of experience to update two q-tables. The paper says two independent q-tables will benefit training.
I've tested my thought out on a similar code base and the owner agreed with my view so far. I've opened a stack overflow question here. Could you say something about this? I think I'll test the implementation as well.
Thanks in advance.
Hi,
we indeed use the same data to update both of the Q-functions. I haven't tested splitting the data and using different sets for different Q's, but I'm guessing that that doesn't make much difference in terms of the maximization bias. My reasoning is that we evaluate the Q-functions (both for the TD target and the policy target) at actions that is not part of the data but instead samples from the current policy. For those actions, given a seen state, the Q values are less correlated since the Q's were never trained for those particular actions, thus reducing maximization bias. We've observed that this can make a big difference in practice, especially in higher dimensional tasks.
I hope this answers your question!
Tuomas