SpellingCorrector

NLP project to leverage Levenshtein distance to spell correct text in a corpus. Tokenization and de-tokenization uses RegEx

Simple Setup

make install to install the only package we need: Numpy.

How it Works

random_text.txt contains the corrupted body of text as a single string.

english_words_list.txt contains a list of words to use as a dictionary to correct to. The list I am using contains 10,000 frequently used words in descending order of frequency.

spelling_corrector.pyuses Levenshtein distance to correct spellings of words in random_text.txt.

In Action

Corrupt text: Corrected text:

Highlights & Limitations

SpellingCorrector works well:

for cases involving multiple types of delimiters and/or a combination of delimiters.
to keep the uppercase, titlecase and numeric characteristics of each token.

With some limitations:

The speed - feel free to calculate the bigO notation yourself 😛
unaccounted edge cases include
- contractions
- plurals
- nonASCII characters
- corrupted numbers
- words that are mispelled to be an existing different word (i.e. "in" as "i")

dai-anna/Levenshtein-SpellingCorrection

SpellingCorrector

Simple Setup

How it Works

In Action

Highlights & Limitations