Program to compute the bigram model (counts and probabilities) on the given corpus (HW2_F17_NLP6320-NLPCorpusTreebank2Parts-CorpusA.txt) under the following three (3) scenarios:
- No Smoothing
- Add-one Smoothing
- Good-Turing Discounting based Smoothing
- Note:
- The “ . ” string sequence in the corpus is used to break it into sentences.
- Each sentence is tokenized into words and the bigrams computed ONLY within a sentence.
- Used whitespace (i.e. space, tab, and newline) to tokenize a sentence into words/tokens that are required for the bigram model.
- Any type of word/token normalization (i.e. stem, lemmatize, lowercase, etc.) have not been performed.
- Creation and matching of bigrams is exact and case-sensitive.
- Input Sentence:
The Fed chairman warned that the board 's decision is bad
Transformation-based POS Tagging:
Implemented Brill’s transformation-based POS tagging algorithm using ONLY the previous word’s tag to extract the best transformation rule to:
- Transform “NN” to “JJ”
- Transform “NN” to “VB”