Mismatch in ratios calculation

Question

Mismatch in ratios calculation

Opened this issue 5 months ago · 0 comments

In PPO when the ratios are computed it has: epx(logits - log probs). Although it does not prevent the agent from learning, it should have the same type, either keep logits, or add a softmax layer in the actor network.