/autocorrect

Spelling corrector in python

Primary LanguagePythonGNU General Public License v3.0GPL-3.0

Autocorrect

Spelling corrector in python. Currently supports English, Polish, Turkish, Russian, Ukrainian and Spanish, but you can easily add new languages.

Based on: https://github.com/phatpiglet/autocorrect

Installation

pip install autocorrect

Examples

>>> from autocorrect import Speller
>>> spell = Speller()
>>> spell("I'm not sleapy and tehre is no place I'm giong to.")
"I'm not sleepy and there is no place I'm going to."

>>> spell = Speller(lang='pl')
>>> spell('ptaaki latatją kluczmm')
'ptaki latają kluczem'

Adding new languages

First add special letters in autocorrect/constants.py.

Now, you need a bunch of text. Easiest way is to download wikipedia. For example for Spanish go to: https://dumps.wikimedia.org/eswiki/latest/ and download eswiki-latest-pages-articles.xml.bz2

bzip2 -d eswiki-latest-pages-articles.xml.bz2

After that:

>>> from autocorrect.word_count import count_words
>>> count_words('eswiki-latest-pages-articles.xml', 'ru')
tar -zcvf autocorrect/data/es.tar.gz word_count.json

Speed

%timeit spell("I'm not sleapy and tehre is no place I'm giong to.")
410 µs ± 6.84 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit spell("There is no comin to consiousnes without pain.")
186 ms ± 1.59 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Contribute

https://github.com/fsondej/autocorrect

Todo

  • some words are corrected to implausible versions (see english2 in unit_tests)
  • python2 doesn't support correction with polish special chars
  • option to disable double typos for speed
  • it looks that loading spellers multiple times may be leaking memory
  • in double typos we check same words twice
  • clean repo: https://stackoverflow.com/questions/2116778/reduce-git-repository-size
  • maybe use LFS