Louise Y. Lai
N12709809 ll2663
Predictive Analytics
Submission 1, Oct 22 2017
Parent class: Preprocessing.java
Child class: NGram.java, TFIDF.java
Independent class: DescriptiveModeling.java
This is the parent class because it contains preprocessing methods that can be inhereted by N-grams and TFIDF.
Functions: convert file to ArrayList, punctuation removal, extract unique terms.
Implements tokenization, lemmatization, Named Entity Tagging, and removes stopwords using the Stanford NLP Library.
An NGram object is a matrix that shows which words appear together most often. The data structure of an NGRam object is a 2 dimension ListArray that has columns as a hashMap, with the token as the key and an int array as the frequency that those two words corral.
Output interpretation: To read the array, first look at the row. For example, "the". Then look at a column, e.g. "mail". The corresponding value is 2, meaning that "the mail" occurs twice in our document.
Functions: calculates and prints NGram matrix, retrieves NGram given desired threshold
TFIDF employs a similar data structire as an NGram object. The values are the TF-IDF values of a given term (row) in a given document (column) against the entire corpus of documents (all rows).
Functions: calculates and prints TFIDF matrix
- DescriptiveModeling/Preprocessing: move lemmatization to be in Preprocessing instead of DescriptiveModeling.
- Preprocessing: improve punctuation removal. Currenly ignoring apostrophes.
- NGram: instead of just bigram, expand to tri- to n- grams in function addFrequencies.
- NGram: intead of indicating threshold, indicate 'top 3' (or similar) in function getConcurrent.
- Word to Vec: Task has yet to be completed.
- Implement Viterbi probabilites based on part-of-speech tagging learnt in my NLP class to give better predictions