/InfoRet-System

An information retrieval system for boolean queries, proximity quries and wildcard queries using Inverted indexing, Biword indexing, positional indexing and soundex indexing.

Primary LanguagePythonApache License 2.0Apache-2.0

DESCRIPTION OF EACH FILE:

There are 12 files in this asignment folder. BooleanOperator.py: defines and, or and not operator for list data structure. Conversion.py: defines infix to postfix conversion of boolean expressions ExtendedBinaryRetrieval.py: defines the extended binary retrieval model (phrase query with biword index). InverseIndex.py: defines basic inverted indexing Lemmatizer.py: defines tokenization and lemmatization of text main.py: main program. This is where from where you can test this assignment. Query.py: defines query processing (both normal and biword query processing) README.md: this file Stack.py: defines different operations of stack data structure PositionalIndex.py: defines positional indexing SoundexIndex.py: defines soundex indexing Soundex.py: defines soundex algorithm

  • ExtendedBinaryRetrieval.py extends InverseIndex.py

FOLDERS:

Dataset/corpus for this assignment is present in the Dataset folder. posting_list.txt and biword_index.txt contain posting lists for single words and biwords respectively.

Indexes folder contains the indexes generated by the program. The indexes are stored in the form of a dictionary. The dictionary is stored in a text file.

THINGS TO ADD:

  • Implementing indexes through B+-trees
  • Better structure to classes
  • Processing proximity queries.