seungeunrho/minimalRL

Wrong gradient flow in bias correction term of ACER?

wwiiiii opened this issue · 1 comments

loss2 = -correction_coeff * pi * torch.log(pi) * (q.detach()-v) # bias correction term

According to original paper, gradient for bias correction term is define as below,
image
and as pi serves as the probability for expectation calculation, it seems it's not the target of optimization.

Shouldn't we detach the pi from computational graph at above line?

Wow, you're correct.
Thanks for such a sharp comment.
I updated the code.