memray/seq2seq-keyphrase-pytorch

predicting multiple keyphrases

Closed this issue · 7 comments

Hi Rui,

This isn't directly related to your pytorch implementation but I read your paper and am interested in reproducing. Though I am confused about how the predictions are made at test time.

In your paper you split the one to many relationship between documents and phrases into pairs:

doc1: phrase1
doc1: phrase2
doc1: phrase3

and use this to train a seq2seq task where we predict a phrase given a document.

I understand how we could directly use this to predict a single keyphrase given a document, but how do we predict multiple phrases at test time?

concretely, once the model is trained shouldn't it behave deterministically? so that model(doc1) always predicts phrase1?

thanks,
Austin

Hi Austin,

Good point! Actually I used beam search as a surrogate to generate multiple phrases, returning the top probable phrases conditioned on the input text. The 5th and 6th slides might be helpful to you. During the search, taking each word must result in a new search branch, therefore we end up with many different phrases.

Thanks,
Rui

That makes sense! So the other keyphases are just the 2nd - n most likely sequences as scored by beam search.

Just out of curiousity, have you tried setting up the learning objective by concatenation the keyphrases together and generating everything in one go?

Something like:

doc1 : {sos} phrase1 {sep} phrase2 {sep} phrase3 {eos}

We are thinking of new methods to improve the performance, and we are doing something similar as you suggested. One tricky problem is that if we construct the output as one sequence, the decoder might be sensitive to the order of phrases.

Perhaps that can be avoided that with some data augmentation. Perhaps randomly shuffling the phrase order at training time.

Yes exactly, that's the basic idea we are doing. We are also trying out some new mechanisms, but not done yet.

Cool! I'm planning to try it as well. I'm lazy so I just wanted to make sure there wasn't some obvious domain related reason you didn't set things up like that in the first place. Best of luck with your experiments, thanks for the conversation.

No problem. It was nice discussing with you :D