horoiwa/deep_reinforcement_learning_gallery

Bug: r2d2/atari multistep-return

Opened this issue · 0 comments

TQ = rewards + self.gamma * (1 - dones) * next_maxQ # (unroll_len, batch_size)

TQ = rewards + self.gamma * (1 - dones) * next_maxQ # (unroll_len, batch_size)

next_maxQへの割引率はgamma ** n_stepが正しい