/Levenshtein-SpellingCorrection

NLP project to leverage Levenshtein distance to spell correct text in a corpus. Tokenization and de-tokenization uses RegEx.

Primary LanguagePython

SpellingCorrector

NLP project to leverage Levenshtein distance to spell correct text in a corpus. Tokenization and de-tokenization uses RegEx

Simple Setup

make install to install the only package we need: Numpy.

How it Works

random_text.txt contains the corrupted body of text as a single string.

english_words_list.txt contains a list of words to use as a dictionary to correct to. The list I am using contains 10,000 frequently used words in descending order of frequency.

spelling_corrector.pyuses Levenshtein distance to correct spellings of words in random_text.txt.

In Action

Corrupt text: Screen Shot 2021-10-13 at 12 13 19 AM Corrected text: Screen Shot 2021-10-13 at 12 09 26 AM

Highlights & Limitations

SpellingCorrector works well:

  • for cases involving multiple types of delimiters and/or a combination of delimiters.
  • to keep the uppercase, titlecase and numeric characteristics of each token.

With some limitations:

  • The speed - feel free to calculate the bigO notation yourself 😛
  • unaccounted edge cases include
    • contractions
    • plurals
    • nonASCII characters
    • corrupted numbers
    • words that are mispelled to be an existing different word (i.e. "in" as "i")