How are we using rewards in imitation learning?
SiddharthSingi opened this issue · 4 comments
Hi, these implementations are amazing, thank you for sharing them. I have a question on how, and rather why are we using rewards in imitation learning?
Line 110 in d94e147
In the paper they have mentioned that instead of using the rewards to improve the policy, we use the log of the discriminator value like so (last line before end of for loop):
As you can see above the policy update uses the log of the Discriminator. Could you please explain why is this term being used instead of the reward?
The reward is computed exactly as log of discriminator, as shown in this line:
Line 99 in d94e147
Thank you for your response. For anyone else also wondering the same thing. Please check this line as well:
Line 42 in d94e147
The reward is computed exactly as log of discriminator, as shown in this line:
Line 99 in d94e147
This is exactly how the reward should be calculated, however everywhere it is written as the expectation of the derivative of the log term (See above images). Can you please tell me why the 'minus' sign is removed in the expectation term?
The reward is computed exactly as log of discriminator, as shown in this line:
Line 99 in d94e147
This is exactly how the reward should be calculated, however everywhere it is written as the expectation of the derivative of the log term (See above images). Can you please tell me why the 'minus' sign is removed in the expectation term?
The D in this code is actually equivalent to the minus D in the original paper.
Lines 125 to 126 in d94e147
From the above two lines of code, it can be seen that the discriminator's update goal is to output 1 for the generated data g_o
and 0 for the expert data e_o
. The goal of the policy update should then be to minimize the output of the discriminator, i.e., to maximize the -log(D(g_o))
reward, which makes the so-called adversarial training.