zverok/spylls

Using spylls to clean-up text file

shantanuo opened this issue · 1 comments

Is it possible to run spylls against a large corpus and remove all mis-spelled words?
Something like asked here...
https://stackoverflow.com/questions/65785287/using-hunspell-to-find-incorrect-words-in-jamspell

It is kind of possible, but spylls maybe not the best tool for the task, and you'll need to write some Python :)
On the highest level, the code will look like:

  1. Load your corpus
  2. Tokenize it into words (with some existing Python tokenization library)
  3. Check each word with spylls Dictionary.lookup() method
  4. Drop those which are False
  5. ...save the filtered corpus.

Probably you can do the same with hunspell (command-line tool or Python wrapper) and it will be more performant...