Potential bug during training?

Is there a reason you calculate the reward the way you do in line 69?

gail-airl-ppo.pytorch/gail_airl_ppo/algo/airl.py

Line 69 in 4e13a23

rewards = self.disc.calculate_reward(

My models were able to learn after I changed that line to

        with torch.no_grad():
            rewards = self.disc.g(states)

This gives the unshaped rewards

Did that work out for you? I found my actor loss unable to converge.

Yes it did, although I was running it on discrete state and action environments. Which env are you using?

@liubaoryol it is great to hear that you got it working with discrete action space! could you please share your code? i think it will be valuable, as multiple people here already asked about discrete action support. Thanks in advance

Of course! Let me clean it up and I'll share it next week:)

I'm interested to know about the implementation for discrete action support too. :)

reward = -logsigmoid(-logits) = -log[1 - sigmoid(logits)] = -log(1 - D), which corresponds the objective of G is minimize log(1-D).