Mismatch in ratios calculation
Opened this issue · 0 comments
itsMyrto commented
In PPO when the ratios are computed it has: epx(logits - log probs). Although it does not prevent the agent from learning, it should have the same type, either keep logits, or add a softmax layer in the actor network.