We are ask to implement several different retrieval methods.
Some of these retrieval methods will be the implementation of the basic retrieval models studied in the class (e.g. TF-IDF, BM25, Language Models with different Smoothing).
Various tools are build on top of Lemur Project toolkits, includes search engines, browser toolbars, text analysis tools, and data resources that support research and development of information retrieval and text mining.
TODO: add requirements for this project.
while installing pyserini
, it might fail to install nmslib
.
here's a work around to install nmslib
on python3.11 environment.
(assuming you have these files)
- Document Corpus
WT2g/
: a collection contains Web documents, with being a 2GB corpus. Will use the corpus to test the retrieval algorithms, and run experiments.
- Queries
topics.401-450.txt
: a set of 50 TREC queries for the corpus, with the standard TREC format having topic title, description and narrative. Documents from the corpus have been judged with respect to their relevance to these queries by NIST assessors.qrels.trec8.small_web
qrels.401-450.txt
Evaluation tools:
trec_eval.pl
- provides a number of statistics about how well the retrieval function corresponding to the results_file did on the corresponding queries.ireval.jar
for using trec_eval.pl
, you can run the following command:
perl trec_eval.pl -[q] qrel_file results_file
to reproduce the results from pyterrier
, run the following command:
make pyterrier
the results will be saved to pyterrier_results.csv
We need to run the set of queries against the WT2g collection, return a ranked list of documents (the top 1000) in a particular format, and then evaluate the ranked lists. see WSM Project 2.pdf for project report.