Using log_alpha in alpha loss

Why does the objective for alpha use log_alpha instead of alpha?

softlearning/softlearning/algorithms/sac.py

Line 278 in 125ee6e

log_alpha * tf.stop_gradient(log_pis + self._target_entropy))

Is this equivalent to the objective in the paper which uses alpha or am I missing something?

Thank you for making this code publicly available.

Yep, minimizing the loss with log_alpha corresponds to minimizing the objective with alpha in the paper. Something like:

min E[-alpha * log_pi - alpha * H]
= min E[-alpha * (log_pi + H)]
= min E[-log alpha * (log_pi + H)]

The reason we're doing this is pretty arbitrary. Generally log values tend to be a bit nicer to work with (i.e. due to numerical stability), but I don't think not using log value would make any difference in this case.

@hartikainen Thanks for the short math derivation. I think it shows that the minimum value of the problem is unchanged if we use log_alpha instead of alpha. From hindsight, this is rather intuitive because alpha and log_alpha are both real numbers from negative infinity to positive infinity.

However, this is still something that I don't understand at first. When using alpha as a multiplier for sample entropy in the calculation of target and policy loss, we need to take the exp of log_alpha to get alpha. But this seems a little shaky, right? Given that log_alpha and alpha are equal in the derivation you presented.

Here's a perhaps naive explanation that I came up with:

min E[- log_alpha * log_pi - alpha * H_targ]
= min log_alpha * (E[-log_pi] - H_targ)
= min log_alpha * (H_curr - H_targ)

If H_curr - H_targ > 0, minimum is achieved by setting log_alpha < 0. This translates to an alpha between [0, 1] (trying taking exp of log_alpha), which lessens the entropy maximization behavior when updating networks.
If H_curr - H_targ < 0, minimum is achieved by setting log_alpha > 0. This translates to an alpha greater than 1, which strengthens the entropy maximization behavior when updating networks.