PyThaiNLP/pythainlp

Spell-Correct: Probability of all corrected words are the same

ming-o-0 opened this issue · 3 comments

In pythainlp/pythainlp/spell/pn.py, you said you fork the code from http://norvig.com/spell-correct.html. As far as I understand, you use the same implementation as in the link.

In your code, you import "WORDS" from dictionary. Instead the link above use corpus (big.txt) rather than dictionary. This make the probability of the corrected words are the same because all words appear only once. The idea behind this code is to chose the most frequent word in the corpus.

Just change "WORDS" to the big corpus.

bact commented

Observation

Confirmed @MingPawat observation. Below is a result from PyThaiNLP 1.7.0.1:

>>> from pythainlp.spell import pn
>>> pn.prob("กิน")
1.9348347651110595e-05
>>> pn.prob("ข้าว")
1.9348347651110595e-05
>>> pn.prob("กัน")
1.9348347651110595e-05
>>> pn.prob("กงง")
0.0
>>> pn.prob("ภาษาไท")
0.0

All words that included in dictionary will have probability value of 1.9348347651110595e-05,
everything else will be 0.0.

I tried to use word frequencies from Thai National Corpus instead
(from pythainlp.corpus import tnc -- they're already a counted number, not the actual corpus).
by replacing

WORDS = Counter(thaiword.get_data())

with

WORDS = Counter(dict(tnc.get_word_frequency_all()))

Here's the result

>>> pn.prob("กิน")
0.0006138412282452856
>>> pn.prob("ข้าว")
0.00026716049573969757
>>> pn.prob("กัน")
0.003979265980548341
>>> pn.prob("กงง")
0.0
>>> pn.prob("ภาษาไท")
0.0

Difference in spelling check end result

Original spell checker (using thaiword.txt):

>>> pythainlp.spell("เหลีนม")
['เหลิม', 'เหลียน', 'เหลือม', 'เหลน', 'เหลียน', 'เลียม', 'เหลียว', 'เหนียม', 'เหลี่ยม', 'เหลียน', 'เลียม', 'เหลียน', 'เหลน', 'เหลิม', 'เหลี่ยม', 'เหลียว', 'เหลือม', 'เหนียม', 'เหลียน', 'เหลียน', 'เหลี่ยม', 'เหลี่ยม', 'เหลิม', 'เหลิม']
>>> pythainlp.spell("เหลียม")
['เหลียว', 'เหนียม', 'เหลี่ยม', 'เหลียน', 'เลียม']

Modified spell checker (using TNC word frequency):

>>> pythainlp.spell("เหลีนม")
['เหลียม']
>>> pythainlp.spell("เหลียม")
['เหลียม']

This is mainly because thaiword.txt does not contain the word "เหลียม", but TNC does.

Other tests with TNC:

>>> pythainlp.spell("กกฎาคม")
['กรกฎาคม']
>>> pythainlp.spell("อนุญาติ")
['อนุญาต']
>>> pythainlp.spell("กิเลย")
['กิเลน', 'กิเลส']
>>> pythainlp.spell("สัตค์")
['สัตว์', 'สัตย์', 'สัตร์', 'สัตถ์']

From a quick human (me) judgement, the suggesting order looks reasonable.

Possible problem with "real world" examples

The problem with using text from a corpus (like TNC) is that, if there is a misspelled word in the example, spell checker may suggest a misspelled word. Have to find out on this as well.

bact commented

@MingPawat I have put a pull request #137 to fix this based on your suggestion. If you have time, please review if it works in a correct way. Thank you.

bact commented

Fixed with #137