Questions about Topk REINFORCE
wwwangzhch opened this issue · 5 comments
Hello, thanks for sharing!
I have some questions about pi_beta_sample
in models.py, you use this function in _select_action_with_TopK_correction
, but it seems only sample one item each time?
I am also confused by Equation 6 in the original paper,
as we want to sample a set of top k item, shouldn't it be
? a_{t, i} represent the ith item at time t.
I appreciate any comments for my question since it's been bothering me for a long time
I honestly don't know, I tried to make it like the paper's authors suggested (with one little tweak discussed here #7). Although, this seems to be more logical. Have you tried to implement this? I can also test it to make sure it's doing the thing. And recently I've found that he algorithm really lacks a TopK normalizing term in prediction, thus it often gets stuck at recommending only one / several items. Thus, adding some diversity penalty or taking top k recommendations could be an improvement.
If you happen to implement this (optionally in another function), you can submit a commit here.
I don't implement this paper but the definition of action in my research is similar to this paper. More specifically, I also need to get a set of items according to the item score produced by the neural network.
When selecting items, I use
if deterministic: # if testing
w_p, w_idx = torch.topk(scores, K)
else: # training
w_idx = torch.multinomial(scores, K)
then I use torch.gather to get the corresponding log probabilities of each selected item. After that, I add these log probabilities up at each step and multiply the R_t when updating the net.
I found this paper because the example code of policy gradient are all only choose 1 item each step, so I wonder whether my method is correct or not. I don't add any correction factor in my code so my gradient is
a_{t, i} represents a selected item at time step t.
I try to email the authors about my question, but until now, there has been no response.
Yes, they also didnt respond to my letter where I have notified them about my repo. Thanks for sharing, I will work on making the algo more stable and make sure to try this approach
Ok, I will keep watching this repository, please let me know if you have any new thought, and thanks for your sharing too.
In case someone else might come back to this at some point:
I was wondering the same thing and I implemented it in the scenario where per slate only one action can/will be clicked anyway, hence when receiving feedback we know which item that feedback responds to.
I guess the authors did the same thing because this sounds like it:
(2) While the main policy head π θ is trained using only items on
the trajectory with non-zero reward^3 , the behavior policy β θ ′ is
trained using all of the items on the trajectory to avoid introducing
bias in the β estimate.
with footnote 3 saying:
We ignore them in the user state update as users are unlikely to notice them and as a result,
we assume the user state are not influenced by these actions