How many of the first TF-IDF processing needs to be retained?

Question

How many of the first TF-IDF processing needs to be retained?

ditingdapeng opened this issue 4 years ago · 5 comments

Hello！ I would like to ask how many tf-idf need to be kept at the beginning. Is it fixed?

Thank you!

Answer 1 · 2021-01-07T03:11:50.000Z

Hi @ditingdapeng, thanks for your interest in our work!

Sorry, I'm not sure about which issue (or email?) you are mentioning... Would you give me more information about the how many tf-idf need to be kept at the beginning? Is it about the document filtering process in the inference time or the number of negative examples during training?

Answer 2 · 2021-01-07T07:57:53.000Z

Thank you for your reply. What I want to express is: In your paper, the first jump from the question to the relevant facts is calculated by the tf-idf method. So when using tf-idf to sort supporting documents, how many paragraphs are selected last as the initial nodes of multi-hop? Hope i can express my problem clearly

Answer 3 · 2021-01-07T13:55:22.000Z

Thanks for the clarification!

For our best models, we set the initial retrieval number (F in the paper) to 500, 100, and 100 paragraphs for HotpotQA full wiki, SQuAD Open, and Natural Questions Open, respectively ("Implementation details" section in our paper).

Please see the detailed discussion on the relationship between the number of the initial TF-IDF and performance in Section C.1 & Figure 5 in Appendix.

Answer 4 · 2021-01-07T13:56:46.000Z

Thank you for your kind reply!

Answer 5 · 2021-01-07T16:24:54.000Z

You're welcome! Feel free to start another issue or reach me via email if you have follow-up questions.