Search Engine based on an inverted index developed by Francesca Pezzuti, Pietro Tempesti and Benedetta Tessa for Multimedia Information Retrieval course at University of Pisa during academic year 2022/2023. The documentation can be found here
The project is composed by these main modules:
- CLI
- Common module
- Indexer
- QueryHandler
- PerformanceTest
This module performs the indexing of the document collection using the Spimi and Merging algorithms.
The CLI module is responsible for presenting an interface to the user.
From the interface, the user can input a query and specify some flags as well as the scoring function. (S)he will then
be presented with the top
This module handles the queries received by the CLI module.
In particular, once a query is received, it is pre-processed and tokenized, then the handler retrieves the vocabulary
entries of the tokens and the posting lists and finally applies either DAAT or MaxScore in order to get a ranking of
the top
This module works as a library: it contains the core data structures and functions needed by all the other modules. It contains the core classes of the project as well.
This module performs tests and writes the results in a format suitable for trec_eval
The Indexer module can be compiled using the following optional flags:
- -cr : if specified, it enables compressed reading of the document collection from tar.gz
- -c : if specified, it enables index compression using Unary for frequencies and Variable Byte for docids
- -s : if specified, it enables stopword removal and stemming during documents' processing
- -d : if specified, it enables the execution of the algorithms in debug mode allowing the creation of human-readable files of the data structure that ca be useful for debbugging purposes.
The choice made for the last three flags will be stored and used for query processing.
If no flags are specified, the indexing will work on the uncompressed document collection (a tsv file), the index won't be compressed, stopwords won't be removed, stemming won't be performed, and debug mode won't be activated.
The Query Handler module can be compiled using the following optional flags:
- -maxscore : if specified, it enables MaxScore as dynamic pruning algorithm for query processing
There are no compile flags for this module.
Using the CLI, the user can write a query using either conjunctive mode or disjunctive mode:
- -d: it enables conjunctive mode
- -c: it enables disjunctive mode
The query must be submitted to the system using the following format:
- "query terms [-d | -c]"
After having submitted the query, the user is asked to answer whether if (s)he wants to use TFIDF or BM25 as * scoring function*, the user must specify either:
- "tfidf": it enables TFIDF as scoring function
- "bm25": it enables BM25 as scoring function
There are no compile flags for this module.