xiaochus/Deep-Reinforcement-Learning-Practice

issue about the generation of action

Closed this issue · 3 comments

I have a issue about the generation of action. In your code, the action is generated as follows:
action = mu + np.sqrt(sigma) * epsilon
The mu and sigma denote the mean and stddev of the normal distribution of action, right?
But in your code, them maybe represent action and td_error respectively. I'm puzzled about two parameters.
And, it can be saw in many codes. So, can you explain this piece of code if you feel free?
英文写着累看着,您写中文也行。谢谢您!

@PacificBase 的确是代表mean和stddev,并不是action和td_error。action是从一个分布里进行抽样得到的,td_error是前后两次计算得到的,这两个结果不会在actor单次的向前传播中计算出来。action和td_error是作为y_true传进去的,只在loss部分进行了计算。

也就是说action = mu + np.sqrt(sigma) * epsilon相当于mean加上一个随机的epsilon乘上stddtv,其中的epsilon代表的就算是随机采样了。
还有一个问题,在计算action的对数概率时,为什么要在其及原本的pdf基础上再乘上一个epsilon呢?
不好意思,第一次自己写相关的代码,麻烦您了。

@PacificBase 防止出现log(0)。