/COMP3009-IR-Assignment

BM25 Simple Implementation: COMP3009 Information Retrieval Assignment

Primary LanguagePythonMIT LicenseMIT

This repo used for archived the assignment of COMP3009 Information Retreval.

It is a very interesting assignment. It will give you the illusion that I can hand make an search engine.🥲

(There is a lesson at the end of this course devoted to modern search engines, and it will frustrate you.🫤)

Introduction

The program could work on both corpus(small / large)

search_{large|small}_corpus.py could build index and search on both corpus(small / large). It will extract the documents, remove stopwords and use porter algorithm to stem the words. Then it will build index and store the data into cache.json for future use. For two modes, the program will use BM25 to rank the documents and return the top 15 documents.

evaluate_{large|small}_corpus.py could evaluate the result of search_{large|small}_corpus.py automatic mode. It will calculate:

  • Precision
  • Recall
  • P@10
  • R-precision
  • MAP
  • bpref

Quick Start

File structure

.
├── README.md
├── evaluate_large_corpus.py
├── evaluate_small_corpus.py
├── search_large_corpus.py
└── search_small_corpus.py

0 directories, 5 files

How to start

python3 search_{small|large}_corpus.py -m {automatic|interactive}

python3 evaluate_{large|small}_corpus.py