BM25 clarification
Closed this issue · 2 comments
Hi @taolei87 ,
Thanks for sharing these data! I would like to ask on which text was the BM25 calculated?
In the test.txt and dev.txt files there are the BM25 scores of the questions computed by the Lucene search engine. However, I didn't see anywhere whether the scores are based on the titles, the bodies of the questions, or both of them, and whether stopwords are removed for these scores. Could you please clarify?
Thanks in advance:)
hi @christinazavou ,
We used both titles and the entire bodies as the input to Lucene. The text should be pro-processed under Lucene's default settings. I asked my co-author(s) and here is a long description of what's been done:
I used the python script (based on the Beautiful Soup python library) that Rish shared with me. This script extracted full texts of questions+bodies from the xml askubuntu dump and then I used Lucene to index the data extracted.
I did not truncate anything at 100 words, I kept the full body texts. Rish and I used Beautiful Soup (https://www.crummy.com/software/BeautifulSoup/) to parse xml and remove the html tags. However, I did keep the text inside of the "
" blocks, as I noticed that keeping it slightly improves the BM25 baseline performance. I did not do any tokenization or any other kind of preprocessing before feeding this to the Lucene indexer.
many thanks @taolei87 !