Not sure how to understand KL distance in PPO
Closed this issue · 2 comments
Hi tsmatz~,
Thank you for your wonderful RL tutotials. Your tutorial is much easyier to learn and understand.
I'm not sure how to understand some codes in PPO below, could you spend a second on looking at it?
kl = tf.reduce_sum(
p0 * (a0 - tf.math.log(z0) - a1 + tf.math.log(z1)), axis=1)
I look into the equation of KL distance but don't understand what is the meaning of a0 - tf.math.log(z0)
? My best guess is a0 - tf.math.log(z0)
is a1 - tf.math.log(z1)
is
Appreciate very much! Thank you!
$ KL-div $
$ = -\sum{P(x) \ln{\frac{Q(x)}{P(X)}}} $
$ = -\sum{P(x) (\ln Q(x) - \ln P(X))}} $
$ = \sum{P(x) (\ln P(X) - \ln Q(x))}} $
As you say,
$ \ln P(X) = a0 - tf.math.log(z0) $
$ \ln Q(X) = a1 - tf.math.log(z1) $
The reason why
When I assume 3 dimensional logits inputs a0 = (a0_0, a0_1, a0_2) here to simplify my explanation, the distribution P(X) (which is generated by logits a0) will be the categorical distribution, in which each element has the softmax possibility :
exp(a0_0)/exp(a0_0)+exp(a0_1)+exp(a0_2)
exp(a0_1)/exp(a0_0)+exp(a0_1)+exp(a0_2)
exp(a0_2)/exp(a0_0)+exp(a0_1)+exp(a0_2)
(The reason why we apply "exp" in logits is to make positive number.)
When we apply "log" in each possibility (i.e, we create $\log P(X)$), it will then lead to :
a0_0 - log(exp(a0_0)+exp(a0_1)+exp(a0_2)
a0_1 - log(exp(a0_1)+exp(a0_1)+exp(a0_2)
a0_2 - log(exp(a0_0)+exp(a0_1)+exp(a0_2)
As you can find, the part z0 = tf.reduce_sum(ea0, axis=1, keepdims=True)
means that exp(a0_0)+exp(a0_1)+exp(a0_2) in above example.
Wow, I see! Thank you, tsmatz ! I really really appreciate your help. Thank your for your wonderful tutorials !