Bad case caused by no-consonant char
Opened this issue · 4 comments
In your trained consonantMap_TwoDCode, no-consontant mappings to (99999.0,99999.0), which causes some chars like "我", "一" are not similar with any char with consonant(e.g. "过", "鸡"). Does this make sense?
@marina-danilevsky would you please take a look at it and possibly answer it? I translated only the trained model, the algorithm part is a black box to me.
I'm sorry, I'm not at all sure how to answer this (I do not know Chinese, and my co-author who does, and did some of this implementation originally, left the company some time ago). Could you possibly be more specific? I can't quite tell if this is a bug in the mapping generally, or something you're observing with specific input.
import dimsim
dimsim.get_distance("我", "火")
67339.46237343867
dimsim.get_distance("果", "火")
1.1904761904761905
@marinadanilevsky In the above example, “我”, "果" and "火" have similar pronunciation, but get quite different distance in dimsim. Because "我" has no-consonant which mappings to (99999.0,99999.0), but "火" and "果" have explicit consonant which mapping to (7.0, 3.0) and (7.0, 0.5)
import dimsim
dimsim.get_distance("我", "火")
67339.46237343867
dimsim.get_distance("果", "火")
1.1904761904761905
@marinadanilevsky In the above example, “我”, "果" and "火" have similar pronunciation, but get quite different distance in dimsim. Because "我" has no-consonant which mappings to (99999.0,99999.0), but "火" and "果" have explicit consonant which mapping to (7.0, 3.0) and (7.0, 0.5)
修改dimsim中utils/pinyin.py pinyinRewrite方法中58和64行,即注释掉# self.consonant = ""即可