== Python Version == We're using Python Version 3.8.10 for this assignment. == General Notes about this assignment == # Index The high level idea of the indexing program is to: 1. Create 2 dictionaries. First, with terms as key and the dictionary of documentId with frequency of that term in occurrence as values ('posting'). Second, the other with documentId as key and dictionary of term with a pair of a) weight of that term as values ('weightage'), b) the importance of the court for this documentId. 2. Store these dictionaries in a dictionary-file, and the list of postings in a posting-file. We make the following assumptions for indexing: - The dictionary of weightage can fit in memory. This is because we will be using this information for wtd (weight-term-document) during the search phase. If disallowed, we could have also stored this data structure the same way as the dictionary is linked to posting via file size and offset. - Dataset only contains English and Chinese documents. - The documents and queries to be indexed and searched are in English only; documents which are not in English are ignored in our indexing. The first step would be to build a dictionary for dataset.csv, by iterating through each term in each sentence in each entry (row) in the target dataset. We will be indexing the "TITLE" and "CONTENT" zones of each document. For each zone, we label the current zone being processed using a string ".t" and ".c" for TITLE and CONTENT respectively rather than the full zone name, allowing us to reduce the amount of bytes being written. Note also in this iterative step, we skip Chinese documents by detecting if the TITLE contains Chinese unicode characters from the CJK Unified Ideographs range. For indexing, these are the preprocessing techniques we have used: - NLTK's sentence tokenization - NLTK's word tokenization - Case-folding - WordNetLemmatizer's lemmatization During our preprocessing, we deviated away from PorterStemmer and towards WordNetLemmatizer because the latter preserves the semantics of the original text. On the other hand, the former simply performs a crude "chopping" of the suffixes of words, and may not preserve the original meaning of the word. As a result, using lemmas allows us to perform more language specific and contextual data such as speech. Since the dataset being processed is made up of court cases, they are primarily made up of text converted from spoken English rather than generic passages. Therefore, we take advantage of the format of the data by using lemmatization and reduce them into verbs as they contain the most semantic features in English. Furthermore, it is argued that lemmatization takes up more time to process compared to stemming; however, during our experimentation, the time it took to index and search was about the same. Thus, it should not heavily impact the performance of our index and search. After preprocessing, we insert them into the term dictionary and increment the current document's term frequency counter. Additionally, each document ID is also inserted into the weightage dictionary and increments the term-frequency counter. Also, note that a word position counter is incremented every term found per document so that each occurrence of a term would be inserted into a positional posting list (zero-indexed). After which the program converts the term-frequency of the weightage dictionary into weights of the terms in each document. This is done by iterating through each documentId and its terms, finding their weights (i.e., log term frequency) and then normalizing the resultant weights with the sqrt(sum of weights^2). Additionally, when iterating through each document, we categorize them based on the "Notes about Court Hierarchy" where those documents from court Most Important are labeled as H, Important as M and Rest as L, representing the importance / relevance of that document. Since both weightage and importance are associated directly to a document, the pair is stored in the same dictionary object. Lastly, the posting dictionary is then iterated to extract each final posting and written to postings-file, having its written size and position stored as a 'pointer' in a pointer dictionary. This new pointer dictionary, weightage-importance dictionary is then written to the dictionary-file. # Search We will now discuss the program for search. The high level idea of the searching program is to: 1. Load data from the dictionary-file while keeping a file pointer to postings-file. 2. Process query from query-file. 3. Write the results of the queries to the output-file. We make the following assumptions for searching: - The full postings-file cannot be loaded into memory. - The dictionary-file can be loaded into memory (if both dictionaries were not allowed, we could have written the weightage dictionary to disk, but since speed is a concern we make this assumption so as not to degrade performance). - The quotation marks used for phrasal queries is " ", not “ ”. - "AND" is reserved strictly for boolean queries. The first step of searching is to load the dictionary from the dictionary-file into memory, and access the file pointer to the relevant posting within the postings-file. Note that the postings are not loaded in the same manner as the dictionary, due to the assumptions stated in the previous paragraph. For our next step, we will load the query-file and parse it in QueryDetails. In QueryDetails, these are the preprocessing techniques we have used: - NLTK's word tokenization - Case-folding - WordNetLemmatizer's lemmatization Subsequently, it will check for the type of the query, which is either boolean, phrasal, or free-text. It detects boolean queries via the "AND" keyword, and phrasal queries via its quotation marks. We also perform some basic error checking, including invalid placement of "AND"s. If the query is invalid, the program will immediately return an empty line in the output file. If the query type is not "free-text", we will convert it to "free-text" as the rigid boolean query algorithm does not return desirable results based on our experiments. Even though a positional index was constructed in our indexing step, this was not utilized subsequently for phrasal queries, due to our conversion of all query types to "free-text". For query refinement, we implemented query expansion. Our implementation is a hybrid of NLTK Wordnet's synsets and our manually curated synonym set. For every term in the query, we will first check if we recognise the word (i.e., whether it is part of our custom synset). If we recognise it, we will expand it using the associated custom set of words that we have. If not, we will rely on NLTK Wordnet's synonym sets. We use a similarity score to rank the synonym sets, and then take the first k synonyms (not synonym sets) to expand the terms, so that the query does not drift. The custom set of words are curated from reading through the documents, and attempting to understand the context and grouping together words that frequently come together in the same context (cf. distributional hypothesis). We feel that this should be more informative in the context of legal case retrieval than relying on a generic thesaurus, since our goal is to build a system for the specific purpose of legal case retrieval, thus it should make sense to make an attempt at optimisation by studying the nature of the documents. We also explored pseudo-relevance feedback, but it is no longer part of submission due to its lackluster performance. However, more details can be found in BONUS.docx. Following the parsing, the terms in each query dictionary are then used to compute their cosine scores. Each term is processed to find its weights (which does tf logarithm calculation * inverse document frequency - idf) and then normalizing the resultant weights with the sqrt(sum of weights^2). Next, each query-term-weight then retrieves the document weights of all postings where the term exists in the document, and sums the multiplication of query weight with document weight. Moreover, in this step we also use the document importance (H, M, L) to add more weight to more important documents. We do so by retrieving its importance label, and apply a multiplication of 10, 8.5, and 1 respectively. This pushes the score of more important court documents while lowering unimportant ones, giving us the final cosine score. We have omitted the normalization step, as interestingly our system seems to perform better without it. After which, the results are sorted in descending order of their final cosine score and based on their zone (i.e., "TITLE" or "CONTENT"), where we gave a higher weight to "CONTENT" (80%) as compared to "TITLE" (20%) based on our testing. Finally, we write our rankings to the output file. == Files included with this submission == - algorithm.py : Code implementation for TopK class - index.py : Code implementation for indexing a directory to dictionary and posting files - models.py : Code implementation for VectorSpaceModel class - query.py : Code implementation for abstraction of query format and query expansion - search.py : Code implementation for executing search - weighting.py : Code implementation for TermFrequency, DocumentFrequency, TfIdfWeight class - dictionary.txt : Encoded data file containing the dictionary that maps to the metadata of the postings - postings.txt : Encoded data file containing all posting lists of legal data set - bonus/ : Folder that contains code implementation for other query refinements tried, but not included in submission - BONUS.docx : Explanation of the bonus component. - README.txt : This file == Statement of individual work == Please put a "x" (without the double quotes) into the bracket of the appropriate statement. [x] We, <redacted>, certify that we have followed the <redacted> class guidelines for homework assignments. In particular, we expressly vow that we have followed the Facebook rule in discussing with others in doing the assignment and did not take notes (digital or printed) from the discussions. == References == - Chinese Unicode: https://unicode-table.com/en/blocks/