Homework 2, Descriptive Modeling

Parent class: Preprocessing.java

Child class: NGram.java, TFIDF.java

Independent class: DescriptiveModeling.java

Class Desciptions

Preprocessing.java

This is the parent class because it contains preprocessing methods that can be inhereted by N-grams and TFIDF.

Functions: convert file to ArrayList, punctuation removal, extract unique terms.

DescriptiveModeling.java

Implements tokenization, lemmatization, Named Entity Tagging, and removes stopwords using the Stanford NLP Library.

NGram.java

An NGram object is a matrix that shows which words appear together most often. The data structure of an NGRam object is a 2 dimension ListArray that has columns as a hashMap, with the token as the key and an int array as the frequency that those two words corral.

Output interpretation: To read the array, first look at the row. For example, "the". Then look at a column, e.g. "mail". The corresponding value is 2, meaning that "the mail" occurs twice in our document.

Functions: calculates and prints NGram matrix, retrieves NGram given desired threshold

TFIDF.java

TFIDF employs a similar data structire as an NGram object. The values are the TF-IDF values of a given term (row) in a given document (column) against the entire corpus of documents (all rows).

Functions: calculates and prints TFIDF matrix

To-do Log

DescriptiveModeling/Preprocessing: move lemmatization to be in Preprocessing instead of DescriptiveModeling.
Preprocessing: improve punctuation removal. Currenly ignoring apostrophes.
NGram: instead of just bigram, expand to tri- to n- grams in function addFrequencies.
NGram: intead of indicating threshold, indicate 'top 3' (or similar) in function getConcurrent.
Word to Vec: Task has yet to be completed.

Nice-to-haves

Implement Viterbi probabilites based on part-of-speech tagging learnt in my NLP class to give better predictions

loulai/pamac