Collection search engine

Simple search engine over the Stanford Information Retrieval course collection. Access the collection: http://web.stanford.edu/class/cs276/pa/pa1-data.zip

Authors:

  • Corentin Carteau
  • Alexandre Florez de la Colina
  • Cédric Talbot

This project was realised as part of CentraleSupélec Information Retrieval course taught by Céline Hudelot.

Setup instructions

Common setup instructions

We provide a quick start setup and a complete setup. The following steps need to be run for both.

  • Check that the python version is 3.7.6 or above.
  • Install dependencies: pip3 install -r requirements.txt
  • Download necessary NLTK packages. In a python3 shell, enter the following:
>>> import nltk
>>> nltk.download('stopwords')
>>> nltk.download('tagsets')
>>> nltk.download("wordnet")
>>> nltk.download('averaged_perceptron_tagger')

Quick start setup

  • Download the collection inverted index: https://drive.google.com/file/d/1KPwEjVXzrvLOf7a2WblP9keBtxtLdj6q/view?usp=sharing
  • Unzip it and place it at the project root.
  • Run the application from the project root: python3 app.py If the application does not run, check that you have no other process running on port 5000. The application will take ~30 seconds to start, the time to load the inverted index in memory.
  • Once the application has started, go to http://localhost:5000/ in a browser to make some queries!
  • NOTE: the first query is usually ~10 times slower than the following queries.

By default, the engine returns the top 20 matches. This can be changed in the config.py file.

Complete setup (with local inverted index computation)

  • In config.py, change compute_inverted_index to True
  • Run the application from the project root: python3 app.py
  • This will take ~40 minutes to process the collection and build the inverted index.
  • If the application does not run after, check that you have no other process running on port 5000.
  • Once the application has started, go to http://localhost:5000/ in a browser to make some queries!
  • In config.py, change compute_inverted_index back to False now that the inverted index is saved locally.

Engine components

Collection pre-processing

The collection is already tokenized. As additional pre-processing, all the words with less than 2 characters and english stopwords (using nltk list) are removed. Then, the words are POS-tagged using nltk default POS-tagger and lemmatized using nltk WordNetLemmatizer. This pre-processing is costly compared to simple stemming, but it leads to a better quality of pre-processed documents.

The preprocessed documents are saved in a separate folder with the same structure as the original collection for easier re-use. This pre-processing step takes around 30 minutes on a standard laptop.

Collection indexing

We use sklearn TfidfVectorizer to compute both the collection vocabulary and word count and the TF-IDF representation of each document. The sparse matrix returned by the vectorizer is then used to build the inverted index for the collection. This inverted index is a dictionary which key set is the vocabulary and values are the list of documents sorted by relevance for the corresponding term. The relevance of a document for a term corresponds to the TF-IDF weight of this term in the vectorised document.

The inverted index is saved using pickle to be easily reloaded and re-used. The inverted index computation and saving takes less than 10 minutes on a standard laptop.

Document querying

  • The user query is preprocessed:
    • Stop words are removed
    • Words are lemmatized with the same procedure as for documents
    • Duplicated tokens are removed
  • For each word in the query, relevant documents and their TF-IDF weight for the word are retrieved from the inverted index.
  • Documents for each words are combined and sorted by relevance.

The procedure is the following:

  • Compute the intersection of the retrieved documents with their weight for each term. Union is used if there is not enough documents in the intersection.
  • Score each document with the harmonic mean of its TF-IDF weights for the different query terms. When the document does not contain a query term, the default weight used is half the minimum weight of all the retrieved documents.
  • Return the top documents (i.e. with largest harmonic mean score).

Corpus statistics

The original corpus contains 25 510 231 terms and 354 396 distinct terms. The pre-processed corpus contains 16 184 888 terms and 291 323 distinct terms.