CS3245-HW3: A Python repository from Boon-Jun

This is the README file for A0000000X-A0000000X's submission

== Python Version ==

I'm (We're) using Python Version <2.7.1 or replace version number> for
this assignment.

== General Notes about this assignment ==

Indexing Algorithm:
--------------------

Indexing is done in- memory first and then written into the file at the very end.
The postings are stored in a dictionary with the terms are the keys and a list of postings is the value to each term.
The dictionary of terms is stored in a dicitonary with the terms as the keys and the
tuple of (byte_offset, document_frequency) as the values.

The documents are indexed as follows:

1) Document Ids in the directory are are parsed to integers and sorted

2) Preprocess the text in each document.
i) Prepare the text by replacing '\n' characters in text with ' '
ii) Apply the NLTK sentence tokenizer
iii) Replace all non alphanumeric characters in sentence with space. This step ensures that
the terms in the sentences are normalized. This has to be done as the word tokenizer fails
to remove unnecessary symbols from words frequently leading to reduced term frequency
of a word.
iv) Word Tokenize each sentece with NLTK sentence tokenizer
v) Case fold each token
vi) Apply the NLTK Porter Stemmer

3a) For each term, if the term is new, add the term to the dictionary and the postings.
Then add the occuring Document Id (if it has not already been added) and Term frequency
into the corresponding postings list in the postings and update the document frequency
in the dictionary.

3b) Keep track of the term frequency of every term in the document with a python dictionary.
Update the term frequencies as a term is encountered, creating new keys in the dictionary as newer
terms are encountered.

3c) Using the term frequencies, compute the tf values for every term and then compute the document
vector lengths which will be used for computing document scores in the Vector Space Model.
Vector lengths we calculated by taking the square root of the sum of the squares of the tf values
for a document. Vector lengths are stored in a python dicitonary with the keys being the document ids.

3) Then, add approximately sqrt(document_frequency) evenly spaced skip pointers
to every postings list in the postings, in the form of a list index to a document_id to its right,
accompanying the select Document Ids. Document Ids with skip pointers will be tuples
of the form: (document_id, term_frequency, skip_index). This way, an entire posting list can be loaded as a python list
when searching and skips will be performed by accessing the element in the list at the specified skip index.

4) Write the list of all sorted Document Ids to the top of the postings file. This will be used
for NOT queries.

5) While writing the postings list for each term in the postings file, fill the byte_offset value of
where it is written in the file from the start, in each corresponding term in the dictionary. This
makes finding the postings list to be loaded into the memory more efficient, during the search.

6) Pickle the dictionary and write the it to the dictionary file. The entire dictionary can
be unpickled in the memory during the search as it is relatively small compared to the postings,
which will only be loaded list by list for every search query.

7) Pickle the vector lengths dicitonary and write it to the lengths file. The dicitonary
can be unpickled during the search when the document score computation takes place.

Searching Algorithm:
---------------------
Note: There are some changes to how search.py is used. Refer to the section under
'Files included with this submission' for more details

Before the documents are being ranked, all non alphanumeric characters in the query
string will be replaced with a space before tokenizing it and stemming it with
PorterStemmer. This is similar to how the the words in the documents are indexed.

The documents will then be ranked according to the lnc.ltc ranking scheme.
That is to say that the weights of each term in the document will be calculated as :
(1 + log10(term_frequency_in_documents))
whereas the weights of each term in the query will be calculated as:
tf-idf = (1 + log(term_frequency_in_query)) * log(number_of_documents/document_frequency)
After that, cosine normalization will be applied to the weights of each term in both
the query and the document, and the dot product of the weights of the terms in the
query and the weights of the terms in the documents will give us a score.

However, in our actual implementation, we omit the step of computing the cosine
normalization for the query since the computation of the cosine normalization
for the query will reduce the calculated scores by the same factor, and the
actual ranking of the documents will not be affected.

To retrieve the Top 10 Ranked documents, we utilize the heapq library, nlargest
method.

How the nlargest method selects the top 10 ranked documents:
1) If there are lesser than 10 documents, the method will return all the documents
2) Else:
2a) Create a heap of size 10 first with 10 documents
2b) For each document(A) not added into the heap yet
a)If A has a higher score than the document with the lowest score in the heap
or if A has the same score with the document with the lowest score in the heap,
but a smaller document ID, then the document within the heap will be removed
and A will be added into the heap.
b)If not, A will not be added to the heap.

Since there is a total of N documents and a heap of size 10, there will be an upperbound
of N additions and N removals. Therefore, the time complexity to retrieve the top
10 documents is O(NLog10)

After the top 10(or lesser) documents are retrieved, the document Ids are then
sorted in order of decreasing score and then by order of increasing document Ids
in the event that 2 documents have the same score.

== Essay Questions ==

1. In this assignment, we didn't ask you to support phrasal queries, which is a
feature that is typically supported in web search engines. Describe how you would
support phrasal search in conjunction with the VSM model. A sketch of the algorithm
is sufficient. (For those of you who like a challenge, please go ahead and
implement this feature in your submission but clearly demarcate it in your
code and allow this feature to be turned on or off using the command line
switch "-x" (where "-x" means to turn on the extended processing of phrasal
queries). We will give a small bonus to submissions that achieve this
functionality correctly).

In our opinion, one possible way to perform phrasal queries is to index the documents into
postings for unigrams and bi-grams, and store both type of grams into the index.
We can also extend this to higher n-grams as long as space is not an issue.
Example: "Fight me please" will be indexed as "Fight", "me", "Fight me", "me please"
etc.

To perform the search, n-grams from the query will be first obtained.
When phrasal queries are enabled we could modify the VSM such that the
axes of a vector for each documents includes the n-grams along with the unigrams.
We could then implement the same lnc.ltc ranking scheme and rank each
document accordingly.

However, this implementation does not consider phrases that has a length
of more than n, where n is the highest numbered n-gram stored in the index.

2. Describe how your search engine reacts to long documents and long queries as
compared to short documents and queries. Is the normalization you use sufficient
to address the problems (see Section 6.4.4 for a hint)? In your judgement,
is the ltc.lnc scheme (n.b., not the ranking scheme you were asked to implement)
sufficient for retrieving documents from the Reuters-21578 collection?

Firstly, long queries takes a longer time to process.
Longer queries with a lot of words from a particular document give
very accurate results. However, if the query is short, shorter documents might be favoured
although normalization is done. This is due to the fact that the shorter queries often
has limited axes of freedom for the resulting vector. Shorter documents with less diversity
but words matching with the query will tend to have a larger consine similarity as their vectors have
less axes of freedom similar to shorter queries. Longer documents with similar number of matching words,
on the other hand have more axes of freedom. This means that the presence of other axes for longer
document vectors might "pull" the vector away from the query vector creating a larger angle between them.
Normalization does not solve this as it does not affect the angle between the vectors.

== Files included with this submission ==
README.txt - This file

index.py - Required file for submission
indexer.py - Perfoms the indexing of directory of documents.
dictionary.txt - Pickled dictionary of terms from the Reuters Training Dataset
postings.txt - Postings List of each term specified in dictionary.txt
postings_plaintext.txt - Contains a human readable index with dictionary word and
corresponding posting list
lengths.txt - Stores a dictionary of document lengths
search.py - Required file for submission
Usage of search.py now slightly differs from the original due to the
addition of lengths.txt file
As such, the correct usage of search.py file will now be:
python search.py -d dictionary-file -p postings-file -l length-file -q file-of-queries -o output-file-of-results
search_logic.py - Main implementation of search logic
query_parser.py - Contains simple query parser to split and stem a query string
search_utils.py - Commonly used utility methods for searching

== Statement of individual work ==

Please initial one of the following statements.

[x] I, A0000000X-A0000000X, certify that I have followed the CS 3245 Information
Retrieval class guidelines for homework assignments. In particular, I
expressly vow that I have followed the Facebook rule in discussing
with others in doing the assignment and did not take notes (digital or
printed) from the discussions.

[ ] I, A0000000X, did not follow the class rules regarding homework
assignment, because of the following reason:

I suggest that I should be graded as follows:

== References ==

Forums on IVLE - Compare Search Results
python's heapq documentation and source code - Helps us understand the nlargest method
CS3245 lecture note 7 for the indexing and search algorithm for VSM

Boon-Jun/CS3245-HW3