Questions about Topk REINFORCE

Question

Questions about Topk REINFORCE

wwwangzhch opened this issue 5 years ago · 5 comments

Hello, thanks for sharing!
I have some questions about pi_beta_sample in models.py, you use this function in _select_action_with_TopK_correction, but it seems only sample one item each time?
I am also confused by Equation 6 in the original paper,

as we want to sample a set of top k item, shouldn't it be
? a_{t, i} represent the ith item at time t.
I appreciate any comments for my question since it's been bothering me for a long time

Answer 1 · 2020-01-09T14:20:53.000Z

I honestly don't know, I tried to make it like the paper's authors suggested (with one little tweak discussed here #7). Although, this seems to be more logical. Have you tried to implement this? I can also test it to make sure it's doing the thing. And recently I've found that he algorithm really lacks a TopK normalizing term in prediction, thus it often gets stuck at recommending only one / several items. Thus, adding some diversity penalty or taking top k recommendations could be an improvement.

If you happen to implement this (optionally in another function), you can submit a commit here.

Answer 2 · 2020-01-09T15:04:05.000Z

I don't implement this paper but the definition of action in my research is similar to this paper. More specifically, I also need to get a set of items according to the item score produced by the neural network.
When selecting items, I use

        if deterministic:   # if testing
            w_p, w_idx = torch.topk(scores, K)
        else:    # training
            w_idx = torch.multinomial(scores, K)

then I use torch.gather to get the corresponding log probabilities of each selected item. After that, I add these log probabilities up at each step and multiply the R_t when updating the net.
I found this paper because the example code of policy gradient are all only choose 1 item each step, so I wonder whether my method is correct or not. I don't add any correction factor in my code so my gradient is

a_{t, i} represents a selected item at time step t.
I try to email the authors about my question, but until now, there has been no response.

Answer 3 · 2020-01-09T15:45:31.000Z

Yes, they also didnt respond to my letter where I have notified them about my repo. Thanks for sharing, I will work on making the algo more stable and make sure to try this approach

Answer 4 · 2020-01-09T16:10:27.000Z

Ok, I will keep watching this repository, please let me know if you have any new thought, and thanks for your sharing too.

Answer 5 · 2020-03-07T13:30:30.000Z

In case someone else might come back to this at some point:
I was wondering the same thing and I implemented it in the scenario where per slate only one action can/will be clicked anyway, hence when receiving feedback we know which item that feedback responds to.

I guess the authors did the same thing because this sounds like it:

(2) While the main policy head π θ is trained using only items on
the trajectory with non-zero reward^3 , the behavior policy β θ ′ is
trained using all of the items on the trajectory to avoid introducing
bias in the β estimate.

with footnote 3 saying:

We ignore them in the user state update as users are unlikely to notice them and as a result,
we assume the user state are not influenced by these actions