/InvertedIndex

An inverted index implementation paired with a query processor, both based on the document collection "Passage Ranking Dataset".

Primary LanguageJava

Information Retrieval Project

The problem

The scope of this project is the implementation of an index structure based on the document collection "Passage ranking dataset" available on this page: https://microsoft.github.io/msmarco/TREC-Deep-Learning-2020. This solution allows to handle information retrieval in front of a massive scale of documents, starting from the design of the data structures needed, implementing scalable indexing, and going towards query processing.

Setup

Before running, the collection.tar.gz should be downloaded and added to the collection folder, in order for the program to run.

Snowball Stemmer Installation

The snowball stemmer should be installed manually due to the absence of support from Maven using the following command:

mvn install:install-file -Dfile=./resources/libstemmer.jar -Dpackaging=jar -DgroupId=org.tartarus -DartifactId=snowball -Dversion=1.0

The stemmer jar file is included in the resources folder.

How to use our solution

The project contains a ready-to-use jar file that can be used to test our solution.
The first parameter can be used to choose if the indexing or the query processing component should be used.

Indexing component

Indexing can be performed using proper compile flags while running the program.
If no flags are used indexing is performed by default using the binary format.
Is the index flag is specified, the second flag can determine the type of indexing to be performed.

java -jar information-retrieval-project.jar index .DAT
java -jar information-retrieval-project.jar index .TXT

The textual format will use the ASCII encoding for debugging.

Query processing component

The first flag can be set to query in order to start the query processor.

java -jar information-retrieval-project.jar query

Once launched, the query processing component will require an input query according to this format:

Input Format: [AND—OR] term1 ... termN

The top k documents according to BM25 will be given in output.

Stopwords removal and stemming

Stopwords removal and stemming are used by default.
This setting can be changed using the flags contained in the application.properties file.

Trec_eval test

For the trec_eval test the following command can be launched:

mvn clean test

Results will be written in the /collection/queries.results.txt file.