Khrylx/PyTorch-RL

How are we using rewards in imitation learning?

SiddharthSingi opened this issue · 4 comments

Hi, these implementations are amazing, thank you for sharing them. I have a question on how, and rather why are we using rewards in imitation learning?

rewards = torch.from_numpy(np.stack(batch.reward)).to(dtype).to(device)

In the paper they have mentioned that instead of using the rewards to improve the policy, we use the log of the discriminator value like so (last line before end of for loop):
Screenshot (283)
Screenshot (282)

As you can see above the policy update uses the log of the Discriminator. Could you please explain why is this term being used instead of the reward?

The reward is computed exactly as log of discriminator, as shown in this line:

return -math.log(discrim_net(state_action)[0].item())

Thank you for your response. For anyone else also wondering the same thing. Please check this line as well:

reward = custom_reward(state, action)

The reward is computed exactly as log of discriminator, as shown in this line:

return -math.log(discrim_net(state_action)[0].item())

This is exactly how the reward should be calculated, however everywhere it is written as the expectation of the derivative of the log term (See above images). Can you please tell me why the 'minus' sign is removed in the expectation term?

The reward is computed exactly as log of discriminator, as shown in this line:

return -math.log(discrim_net(state_action)[0].item())

This is exactly how the reward should be calculated, however everywhere it is written as the expectation of the derivative of the log term (See above images). Can you please tell me why the 'minus' sign is removed in the expectation term?

The D in this code is actually equivalent to the minus D in the original paper.

PyTorch-RL/gail/gail_gym.py

Lines 125 to 126 in d94e147

discrim_loss = discrim_criterion(g_o, ones((states.shape[0], 1), device=device)) + \
discrim_criterion(e_o, zeros((expert_traj.shape[0], 1), device=device))

From the above two lines of code, it can be seen that the discriminator's update goal is to output 1 for the generated data g_o and 0 for the expert data e_o. The goal of the policy update should then be to minimize the output of the discriminator, i.e., to maximize the -log(D(g_o)) reward, which makes the so-called adversarial training.