tsmatz/reinforcement-learning-tutorials

Not sure how to understand KL distance in PPO

Closed this issue · 2 comments

Hi tsmatz~,

Thank you for your wonderful RL tutotials. Your tutorial is much easyier to learn and understand.

I'm not sure how to understand some codes in PPO below, could you spend a second on looking at it?

kl = tf.reduce_sum(
            p0 * (a0 - tf.math.log(z0) - a1 + tf.math.log(z1)), axis=1)

I look into the equation of KL distance but don't understand what is the meaning of a0 - tf.math.log(z0) ? My best guess is a0 - tf.math.log(z0) is $log(P(\theta_old))$ and a1 - tf.math.log(z1) is $log(P(\theta_new))$. But Why is that?

Appreciate very much! Thank you!

$ KL-div $
$ = -\sum{P(x) \ln{\frac{Q(x)}{P(X)}}} $
$ = -\sum{P(x) (\ln Q(x) - \ln P(X))}} $
$ = \sum{P(x) (\ln P(X) - \ln Q(x))}} $

As you say,

$ \ln P(X) = a0 - tf.math.log(z0) $
$ \ln Q(X) = a1 - tf.math.log(z1) $

The reason why $\ln P(X)$ = a0 - tf.math.log(z0) is that a0 (and also a1) is logits input.
When I assume 3 dimensional logits inputs a0 = (a0_0, a0_1, a0_2) here to simplify my explanation, the distribution P(X) (which is generated by logits a0) will be the categorical distribution, in which each element has the softmax possibility :

exp(a0_0)/exp(a0_0)+exp(a0_1)+exp(a0_2)
exp(a0_1)/exp(a0_0)+exp(a0_1)+exp(a0_2)
exp(a0_2)/exp(a0_0)+exp(a0_1)+exp(a0_2)

(The reason why we apply "exp" in logits is to make positive number.)

When we apply "log" in each possibility (i.e, we create $\log P(X)$), it will then lead to :

a0_0 - log(exp(a0_0)+exp(a0_1)+exp(a0_2)
a0_1 - log(exp(a0_1)+exp(a0_1)+exp(a0_2)
a0_2 - log(exp(a0_0)+exp(a0_1)+exp(a0_2)

As you can find, the part z0 = tf.reduce_sum(ea0, axis=1, keepdims=True) means that exp(a0_0)+exp(a0_1)+exp(a0_2) in above example.

Wow, I see! Thank you, tsmatz ! I really really appreciate your help. Thank your for your wonderful tutorials !