NLP project to leverage Levenshtein distance to spell correct text in a corpus. Tokenization and de-tokenization uses RegEx
make install
to install the only package we need: Numpy.
random_text.txt
contains the corrupted body of text as a single string.
english_words_list.txt
contains a list of words to use as a dictionary to correct to. The list I am using contains 10,000 frequently used words in descending order of frequency.
spelling_corrector.py
uses Levenshtein distance to correct spellings of words in random_text.txt
.
SpellingCorrector works well:
- for cases involving multiple types of delimiters and/or a combination of delimiters.
- to keep the uppercase, titlecase and numeric characteristics of each token.
With some limitations:
- The speed - feel free to calculate the bigO notation yourself 😛
- unaccounted edge cases include
- contractions
- plurals
- nonASCII characters
- corrupted numbers
- words that are mispelled to be an existing different word (i.e. "in" as "i")