How are we using rewards in imitation learning?

Hi, these implementations are amazing, thank you for sharing them. I have a question on how, and rather why are we using rewards in imitation learning?

PyTorch-RL/gail/gail_gym.py

Line 110 in d94e147

rewards = torch.from_numpy(np.stack(batch.reward)).to(dtype).to(device)

In the paper they have mentioned that instead of using the rewards to improve the policy, we use the log of the discriminator value like so (last line before end of for loop):

As you can see above the policy update uses the log of the Discriminator. Could you please explain why is this term being used instead of the reward?

The reward is computed exactly as log of discriminator, as shown in this line:

PyTorch-RL/gail/gail_gym.py

Line 99 in d94e147

return -math.log(discrim_net(state_action)[0].item())

Thank you for your response. For anyone else also wondering the same thing. Please check this line as well:

PyTorch-RL/core/agent.py

Line 42 in d94e147

reward = custom_reward(state, action)

The reward is computed exactly as log of discriminator, as shown in this line:

PyTorch-RL/gail/gail_gym.py

Line 99 in d94e147

return -math.log(discrim_net(state_action)[0].item())

This is exactly how the reward should be calculated, however everywhere it is written as the expectation of the derivative of the log term (See above images). Can you please tell me why the 'minus' sign is removed in the expectation term?

The reward is computed exactly as log of discriminator, as shown in this line:

PyTorch-RL/gail/gail_gym.py

Line 99 in d94e147

return -math.log(discrim_net(state_action)[0].item())

This is exactly how the reward should be calculated, however everywhere it is written as the expectation of the derivative of the log term (See above images). Can you please tell me why the 'minus' sign is removed in the expectation term?

The D in this code is actually equivalent to the minus D in the original paper.

PyTorch-RL/gail/gail_gym.py

Lines 125 to 126 in d94e147

    
           discrim_loss = discrim_criterion(g_o, ones((states.shape[0], 1), device=device)) + \ 
        
               discrim_criterion(e_o, zeros((expert_traj.shape[0], 1), device=device))

From the above two lines of code, it can be seen that the discriminator's update goal is to output 1 for the generated data g_o and 0 for the expert data e_o. The goal of the policy update should then be to minimize the output of the discriminator, i.e., to maximize the -log(D(g_o)) reward, which makes the so-called adversarial training.

	discrim_loss = discrim_criterion(g_o, ones((states.shape[0], 1), device=device)) + \
	discrim_criterion(e_o, zeros((expert_traj.shape[0], 1), device=device))