tsmatz/reinforcement-learning-tutorials

A little question about SAC

zhiyiZeng opened this issue · 2 comments

Hello tsmatz, sorry to bother you, I'm looking at the wonderful tutorial code of SAC but there is something confusing me.

In the function of optimize_phi, there is a line of code as follows,

        dot_product_t = tf.reduce_sum(mul_t, axis=-1, keepdims=True)

I don't understand why it's using sum, not max which is used in the equaiton of bellman instead. Is there any particular reason behind it?

I really really appreciate any help. Thanks in advance~

Hello, zhiyiZeng-san.
This is because of dot product operation.

As I have mentioned in this notebook, in our categorical policy, both $\pi_\theta(s_{t+1})$ and $Q_{{\phi_i}^{\prime}}(s_{t+1})$ in the following part are 2-dimensional vector (one-hot vector), not scalar value.

$\min_{i=1,2} Q_{{\phi_i}^{\prime}}(s_{t+1},a^*{t+1}) + \alpha H(P(\cdot | \pi\theta(s_{t+1})))$
$= \pi_\theta(s_{t+1}) (\min_{i=1,2} Q_{{\phi_i}^{\prime}}(s_{t+1}) - \alpha \log \pi_\theta(s_{t+1}) )$

On contrary, reward_t and dones_t are all scalar values, and we should convert 2-dimensional (one-hot) representations to scalar values before operation.

For instance, if the probability of $\pi$ is (0.6, 0.4) and the expected Q-values are (300, 200) respectively, then the cumulative expected value will be 0.6 * 300 + 0.4 * 200 = 180 + 80 = 260.
That is, this conversion for scalar values will be dot product operation, and it will then need tf.reduce_sum(..., axis=-1, keepdims=True).

(P.S. Sorry, but I have found miss-spelling in math expression in original notebook, $Q_{1^{\prime}}$ -> $Q_{i^{\prime}}$, and I then fixed now.)

I see! thank you for your quick response.😀