Using spylls to clean-up text file
shantanuo opened this issue · 1 comments
shantanuo commented
Is it possible to run spylls against a large corpus and remove all mis-spelled words?
Something like asked here...
https://stackoverflow.com/questions/65785287/using-hunspell-to-find-incorrect-words-in-jamspell
zverok commented
It is kind of possible, but spylls maybe not the best tool for the task, and you'll need to write some Python :)
On the highest level, the code will look like:
- Load your corpus
- Tokenize it into words (with some existing Python tokenization library)
- Check each word with spylls
Dictionary.lookup()
method - Drop those which are
False
- ...save the filtered corpus.
Probably you can do the same with hunspell (command-line tool or Python wrapper) and it will be more performant...