A little question about SAC
zhiyiZeng opened this issue · 2 comments
Hello tsmatz, sorry to bother you, I'm looking at the wonderful tutorial code of SAC but there is something confusing me.
In the function of optimize_phi
, there is a line of code as follows,
dot_product_t = tf.reduce_sum(mul_t, axis=-1, keepdims=True)
I don't understand why it's using sum
, not max
which is used in the equaiton of bellman instead. Is there any particular reason behind it?
I really really appreciate any help. Thanks in advance~
Hello, zhiyiZeng-san.
This is because of dot product operation.
As I have mentioned in this notebook, in our categorical policy, both
$\min_{i=1,2} Q_{{\phi_i}^{\prime}}(s_{t+1},a^*{t+1}) + \alpha H(P(\cdot | \pi\theta(s_{t+1})))$
On contrary, reward_t
and dones_t
are all scalar values, and we should convert 2-dimensional (one-hot) representations to scalar values before operation.
For instance, if the probability of
That is, this conversion for scalar values will be dot product operation, and it will then need tf.reduce_sum(..., axis=-1, keepdims=True)
.
(P.S. Sorry, but I have found miss-spelling in math expression in original notebook,
I see! thank you for your quick response.😀