This repository is the implementation of the search engine for Homework 1 in the Information Retrieval Class (CSE272 Spring UCSC). All the code are written in python.
Python 3
nltk toolkit (pip install nltk)
trec_eval (cd trec_eval; make)
Download the dataset from "google drive" and put all the file in raw_data/. In the raw data, "ohsumed.88-91" is the collection of the documents; "query.ohsu.88-91" is the query file; "qrels.ohsu.1-63" is the relevance file.
Run command below:
python data_parser.py
This command will parse the documents and querys to generate inverted index and other useful data. These data are stored in parsed_data/. This command will also parse the relevance file to make it friendly for trec_eval [Simply add "0" in each line]
In the repository, there are "boolean.py","tf.py", "tfidf.py", etc, indicating each method. To get the ranking results, just directly run the python file. For example, for tf-idf method, just run the command below:
python tfidf.py
The above command will put the ranking results in results/
Suppose trec folder and this folder are in the same repository. To get the evaluation results of tf-idf, simply run:
./trec_eval -m all_trec -q ../IR_HW1/raw_data/qrels.ohsu.88-91_trec_friendly ../IR_HW1/results/tfidf_results.out > ../IR_HW1/results/tfidf_evaluation.out
'tfidf_evaluation.out' will contain many different evaluation measure results.
After running all the algorithms in the repository including boolean, tf, tf-idf, relevance feedback, own method. Simply run
python plot_results.py
Note you may need to modify the path and file name to make it run successfully. This command will generate two figures in results/. You can compare the performance of each algorithm by viewing these two figures.