The scope of this project is the implementation of an index structure based on the document collection "Passage ranking dataset" available on this page: https://microsoft.github.io/msmarco/TREC-Deep-Learning-2020. This solution allows to handle information retrieval in front of a massive scale of documents, starting from the design of the data structures needed, implementing scalable indexing, and going towards query processing.
Before running, the collection.tar.gz should be downloaded and added to the collection folder, in order for the program to run.
The snowball stemmer should be installed manually due to the absence of support from Maven using the following command:
mvn install:install-file -Dfile=./resources/libstemmer.jar -Dpackaging=jar -DgroupId=org.tartarus -DartifactId=snowball -Dversion=1.0
The stemmer jar file is included in the resources folder.
The project contains a ready-to-use jar file that can be used to test our solution.
The first parameter can be used to choose if the indexing or the query processing component should be used.
Indexing can be performed using proper compile flags while running the program.
If no flags are used indexing is performed by default using the binary format.
Is the index flag is specified, the second flag can determine the type of indexing to be performed.
java -jar information-retrieval-project.jar index .DAT
java -jar information-retrieval-project.jar index .TXT
The textual format will use the ASCII encoding for debugging.
The first flag can be set to query in order to start the query processor.
java -jar information-retrieval-project.jar query
Once launched, the query processing component will require an input query according to this format:
Input Format: [AND—OR] term1 ... termN
The top k documents according to BM25 will be given in output.
Stopwords removal and stemming are used by default.
This setting can be changed using the flags contained in the application.properties file.
For the trec_eval test the following command can be launched:
mvn clean test
Results will be written in the /collection/queries.results.txt file.