SimpleSearchEngine (Course project for ITCS414 - Information Storage and Retrieval)

Implementation of a simple indexer that builds an uncompressed index over a corpus, and retrieval for Boolean conjunctive queries.

The program structure mainly consists of 3 main classes which are BasicIndex.java, Index.java and Query.java. The BasicIndex.java is used for directly doing operations on the files, both reading and writing which is used in both Index.java and Query.java. The Index.java is used for indexing, or creating the index files from the dataset. Query.java is used for querying from the index generated from the Index.java.

The indexing is done in Index.java, the process starts by accessing the dataset as blocks (subdirectory), and the files of each block will then be accessed and tokenized. Using the tokens, we can put them into the PostingLists and create the index for each block. After the indexes of the blocks are created, they will be merged pair by pair with the help of mergePosting() which is created to merge two PostingLists with the same term ID. After all the merging, we will get the final corpus.index which contains all the PostingLists of the dataset.

The retrieval is done mainly in the given function named retrieve() in Query.java. We added code to create ArrayList for storing document IDs that match the query. We do it by tokenizing the query and get the PostingList for each term in the query from the index created from the indexing. Then, we sort them with the frequency, getting the smallest PostingList at the front. After that, we check the first ArrayList for each document ID. If the document ID appears in every PostingList, it will be added to the output ArrayList and will be returned at the end. In the given function named outputQueryResult() in Query.java, we added code to print name for each matching document ID. If it doesn’t match, it will print "no results found".

ErbaZZ/SimpleSearchEngine

SimpleSearchEngine (Course project for ITCS414 - Information Storage and Retrieval)