Bug: r2d2/atari multistep-return

Line 163 in 8a9e329

    
           TQ = rewards + self.gamma * (1 - dones) * next_maxQ                # (unroll_len, batch_size)

Line 163 in 8a9e329

TQ = rewards + self.gamma * (1 - dones) * next_maxQ # (unroll_len, batch_size)

next_maxQへの割引率はgamma ** n_stepが正しい