Information Retrieval System

This is project is implemented for ISNA news agency dataset in four phases:

  • Tokenizer and normalizer functions are implemented for Persian texts and one-word query responding, which is not ranked retrieval.
  • Efficient query responding and ranked retrieval by using champions list and heapsort.
  • K-means and KNN are used to speed up query responding.
  • Boolean Retrieval and Spell Correction system implemented using ElasticSearch.

Table of Contents

Phase1-Implement an inverted index for one-word queries

The most important steps implemented in this phase:

  • Persian Tokenizer
  • Persian Normalization functions:
    • remove_punctuation_marks()
    • remove_postfix()
    • remove_prefix()
    • verb_steaming()
    • char_digit_unification()
    • morakab_unification()
    • remove_arabic_notations()
  • Find and remove stop-words
  • Inverted index by BOW (bag of word) representation for docs
  • One word query responding (It's not a Ranked Retrieval system)

Phase2-Efficient query responding by heapsort and champion list

In this phase, we improve IR system accuracy and speed with famous IR techniques.
The most important steps implemented in this phase:

  • tf-idf Vector representation for docs
  • Similarity calculation (cosine-sim)
  • Index elimination
  • Champion list (tf base)
  • K-top extraction by max-heap
  • Pharase query responding (Ranked Retrieval system)

Phase3-Speed up query responding by K-means clustering and KNN

In this phase, we use a larger dataset to deal with time and memory limitations.

K-means

To have a Ranked Retrieval System like the second phase for query responding ittakes a long time, for decreasing online computing we have to use clustering techniques. we chose k-means and after several experiments, we choose k = 100 and repeat clustering and updating centroid for 5 iterations.

KNN

In this section we implement a categorized search engine with 5 categories

  • "culture"
  • "economy"
  • "sports"
  • "politics"
  • "health"

we use KNN for labeling docs and we check k = 3, 5, and 7 and get the best result by 5 for search in this search engine, you have to enter your query like this.

cat:<cat> <query>
ex: cat:sport قهرمانی پرسپولیس

Phase4-Boolean Retrieval and Spell Correction using ElasticSearch

In this phase, we work by ElasticSearch to deal with larger dataset. This phase can be divided in 2 parts.

  1. Boolean Retrieval
  2. Spell Correction

Boolean Retrieval

At first, an inverted index was created by Bulk API, which is more than 30 times faster than For loop iteration. The query structure is implemented in a way that the user can search for a phrase that includes several words or imply some operations like and (&), or (||), and not (!) to make more accurate queries.

query= {
        "bool": {
          "should": [
              { 
                "match": {
                  "content": {
                    "query": "", # add query word 
                  }
                }
              }, 
              
              { 
                "match_phrase":{
                  "content":{
                    "query":"", # add query word 
                  }
                }  
              },
          ],
          "must_not": [
              {
                "match": {
                  "content": {
                    "query": "", # add query word 
                  }
                }
              }
          ],
        },
    }

For more details, you can Match Query and Match Phrase Query.

Spell Correction

I implemented a spelling correction system with Elasticsearch. There are three steps and in each of which we add a feature and make an improvement. For the implementations, I used Python Elasticsearch Client.
steps:

  • Creating index of trigrams and bigrams
  • Index construction with inverted tokens
  • Synonymous word checking

License

Distributed under the MIT License. See LICENSE for more information.

Contact

Mohammad Javad Ardestani: mjavad.ardestani00@gmail.com
Project Link: https://github.com/MohammadJavadArdestani/Information-Retrieval-System

Acknowledgments