NLP ALGORITHMS

My NLP algorithms and notes since my NLP journey started.

The chosen and implemented algorithms:

All implementations have comments and explanations in source code as well.
Maximum Matching Algorithm is an algorithm which finds the words of a sentence 
if the sentence has no separators(spaces) between words. It is applicable for 
languages such as Chinese, Japanese.
It's not accurate for languages that have long words such as English.
Porter's Algorithm is an algorithm that is the most common English stemmer. 
It finds the stem of the word. 
(Note: The implementation does not include all the steps. May not be work for all words.)
Binary Classifier is a class of algorithms that look through the possible decisions and 
decides the most possible one.
The implementation is a binary classifier that decides whether a period ends a sentence or not.
Minimum Edit Distance is an algorithm which calculates the similarity between two words.
Some of the applications of the algorithm are to correct the typos and recommend similar words. 
There are various implementations (scoring systems) based on the task. The default scoring 
system is Levenshtein.
The Needleman-Wunsch algorithm is an algorithm that maximizes the similarity 
between two gene sequences (strings) to find the global alignment. All characters(nucleotides)
must match.
The Smith-Waterman algorithm is an algorithm that is used in NLP and Bioinformatics to detect 
all possible alignments. Checking the confusion matrix is a way to find matches. 
Some characters(nucleotides) may not match.
N-grams are a class of probabilistic language models that compute the probability of 
a given word set or sentence. Its applications are, e.g., machine translation, 
speech recognition,and so on. These algorithms use the chain rule of probability and 
Markov assumption and use the formulas with a chosen N. Here, N stands for a positive 
whole number. Some of most common N-gram models are unigrams(N=1), bigrams(N=2) and trigrams(N=3). 
Unigrams computes the probability of a chosen word with Maximum Likelihood Estimate.
Each model differs from each other in the size of word groups. Unigrams group the words 
as themselves, bigrams as binary word groups, and trigrams as triple word groups.