/information-retrieval

This project aim to explore the different techniques of information retrieval. We implemented different probabilistic models and compared them to the vector space model and some machine learning models.

Primary LanguageJupyter Notebook

Information Retrieval

This project aim to explore the different techniques of information retrieval. We implemented different probabilistic models and compared them to the vector space model and some machine learning models.

Getting Started

Here is a guide to help you run the project on your local machine.

Prerequisites

You need to install the requirements.txt file to run the project. You can use the makefile to do so.

make install

or

pip install -r requirements.txt

Running the project

To run the project, you can either use the makefile or run the main file.

Here is the options you can use:

Option Description
-h, --help Show the help message and exit
-e, --export-inverted-index Export collection
-i, --import-inverted-index Import collection
-s, --statistics Export statistics
--ltn Use LTN weighting scheme
--ltc Use LTC weighting scheme, length normalization and cosine similarity
--lnu Use LNU weighting scheme
--bm25 Use BM25 weighting scheme
--bm25fw Use BM25Fw weighting scheme
--bm25fr Use BM25Fr weighting scheme
--cos-sim Use cosine similarity for evaluation
-o, --bm25_optimization Run BM25 parameter optimization experiment
-g, --granularity GRANULARITY Granularity of the XPath query
--baseline Run baseline
--export-weighted-idx Export weighted index to JSON file
--query-file QUERY_FILE File containing queries
-p, --pre-processed Use pre-processed collection to run the experiment

Examples

Here is some examples of how to run the project.

make practice5v5 -- --bm25 -g "'.//article'" "'.//title'"

make practice5v5 -- --baseline

Using the -p or --pre-processed option will use the pre-processed collection to run the experiment.

make practice5v5 -- --bm25 -g "'.//article'" "'.//title'" -p

The time of execution will be faster using the pre-processed collection, but you need to have the pre-processed collection in the /lib/processed_data/ folder.

If you don't have the pre-processed collection, you can use the pre_processing notebook to generate it.