soft-Q-learning推导

论文Reinforcement Learning with Deep Energy-Based Policies 在策略中引入熵，定义了softQ、softV，给出了soft Bellman Equation、策略提升定理，证明了soft Q可以通过soft Bellman Equation迭代收敛。

proof.pdf 对以上内容进行推导。