It seems that the commutative variable is recalculated here:
First time:
|
r_coef = (reward_loss_grad * b_step_dir).sum(0, keepdim=True) # todo: compute r_coef: = g^T H^{-1} b |
Repeated :
|
r_coef = (reward_loss_grad * b_step_dir).sum(0, keepdim=True) # todo: compute r_coef: = g^T H^{-1} b |