-
Design and implement a simple IR system.
-
The system should:
-
Create the inverted index (the dictionary and postings lists) for your collection of documents
-
Parse and execute simple queries
-
Perform simple tokenization and normalization of the text such as removing digits, punctuation marks, etc.
-
Statistics:
-
Report the number of distinct words observed in each document, and the total number of words encountered.
-
Report the number of distinct words observed in the whole collection of documents, and the total number of words encountered.
-
Report the total number of times each word is seen (term frequency) and the document IDs where the word occurs (Output the posting list for a term).
-
Report the top 100th, 500th, and 1000th most-frequent word and their frequencies of occurrence.
-
Create postings and assign a term frequency to every document in postings list.
-
Provide a simple GUI to test the system.
-
-
To access repo, email me at migl8239@kXXX.edu