为什么generator loss 所用 reward 和 paper 所述不一样？

Question

为什么generator loss 所用 reward 和 paper 所述不一样？

superzhangmch opened this issue 7 years ago · 8 comments

看到代码中 generator loss 所需的 reward 有下面各种表达形式：
reward = 2 * (tf.sigmoid(...) - 0.5)
reward = tf.sigmoid(tf.maximum(0.0, 1.0 - (pos_score - neg_score)))
reward = tf.log(tf.sigmoid(neg_score - pos_score))

但是没有一处和paper中叙说的一样：
... the term **log(1+exp(f_φ(d, q_n))) acts as the reward ** for the policy
p_θ(d|q_n, r) taking an action d in the environment q_n [38].
In order to reduce variance during the REINFORCE learning, we
also replace the reward term log(1+exp(f_φ(d, q_n))) by its advantage
function: log(1+exp(f_phi(d, q_n))) + E_{d~p_θ(d|q_n, r)}log(1+exp(f_φ(d, q_n))) ...

为什么

Answer 1 · 2017-08-14T06:42:50.000Z

The same question. Why does the reward function in the code is different with the ones in the paper?

Answer 2 · 2017-08-21T10:06:57.000Z

Which reward function was used to generate the results in the paper?

Answer 3 · 2017-08-21T10:14:14.000Z

Also do we have any KL divergence bounds on the SVM loss?

And why are we using weight regularization, even when WGAN uses Gradient Clipping and a gradient penalty along with batch normalization layers?

Answer 4 · 2017-09-06T15:50:44.000Z

+1 which reward function is correct?

Answer 5 · 2017-10-03T17:36:38.000Z

In the paper, we provide the log(1+exp(logits)) formulation for generality, which makes the framework theoretically achieve the ideal Nash Equilibrium through the training procedure. But in practice, the optimal convergence to the equilibrium is notoriously very hard (or almost impossible) to achieve and we do need a bunch of tricks to ease the training process. For example, in the original GANs paper (2014), to avoid the saturation of log(1-D(G(z))), they turn to maximizing log(D(G(z))). Also, in WGANs, they need weight clipping or gradient penalty as practical implementations to achieve the Lipschitz property.
Thus, an alternative reward design sigmoid(logits) with an advantage function setting could be used to achieve satisfactory training. We think this could be regarded as a kind of reward shaping, which is quite common in RL literatures. For example, when we want to train a simulated robot to run in an environment as long as possible, the most generic form of reward is 1 if the robot is running, 0 otherwise. But in practice, we need to design a bunch of reward functions considering angles, speed, height, etc. And frankly speaking, there is no guarantee that after reward shaping, the training will be consistent with the original objective, but sometimes we do need these tricks to make things work in practice. Therefore, there can be various kinds of reward design that are worth trying in IRGANs tasks. And of course, anyone could try the original reward version as described in the paper, which is still a quite good reward implementation in some scenarios.

Answer 6 · 2017-10-04T05:29:20.000Z

Hi @LantaoYu, your answer is too broad to clarify my thinking. It suggests the experiments do not implement what is described in the paper (?).

Can you tell us precisely what reward function was used to generate the results presented in the paper?

Thank you,

Arjen

Answer 7 · 2018-06-13T02:03:55.000Z

@arjenpdevries I think that is what he meant. It seems that in the implementation, the four tasks used four different reward functions, which are different from the one presented in the SIGIR paper.
I did not find these four rewards in the paper and only see them in the implementation.

Lantao said: "this could be regarded as a kind of reward shaping, which is quite common in RL literatures." Nevertheless, the original GANs paper discussed that they turned to a different reward, and also why, if I remember correctly.

Answer 8 · 2018-06-14T06:59:02.000Z

We've added the explanation in the appendix of the arxiv version:
https://arxiv.org/pdf/1705.10513.pdf
Thanks.