mozillazg/phrase-pinyin-data

为什么没有在pypinyin中使用large_pinyin.txt?

Jackiexiao opened this issue · 4 comments

是因为large_pinyin.txt里准确性比较低吗?

另外我觉得换成yi1 yi2 yi3 yi4 yi5这种形式比较好维护词典,顺手写了一个转换程序,不知道pr到哪里好比较合适,就先贴这里了/捂脸

PHONETIC_SYMBOL_DICT = {
    "ā": "a1",
    "á": "a2",
    "ǎ": "a3",
    "à": "a4",
    "ē": "e1",
    "é": "e2",
    "ě": "e3",
    "è": "e4",
    "ō": "o1",
    "ó": "o2",
    "ǒ": "o3",
    "ò": "o4",
    "ī": "i1",
    "í": "i2",
    "ǐ": "i3",
    "ì": "i4",
    "ū": "u1",
    "ú": "u2",
    "ǔ": "u3",
    "ù": "u4",
    # üe
    "ü": "v",
    "ǖ": "v1",
    "ǘ": "v2",
    "ǚ": "v3",
    "ǜ": "v4",
    "ń": "n2",
    "ň": "n3",
    "ǹ": "n4",
    "\u1e3f": "m2"
}

def _get_keys(adict, value):
    return [k for k, v in adict.items() if v == value][0]

def replace_number_to_symbol(pinyin):
    """例如:把数字yang1替换为声调yāng"""
    if not pinyin[-1].isdigit():
        return pinyin
    if pinyin[-1] in ['5', '6', '7', '8', '9']:
        return pinyin[:-1]
    tone = pinyin[-1]
    pinyin = pinyin[:-1]
    for vowel in ['a', 'o', 'e']:
        if vowel in pinyin:
            number = vowel+tone
            symbol = _get_keys(PHONETIC_SYMBOL_DICT, number)
            return re.sub(vowel, symbol, pinyin)
    if 'ui' in pinyin:
        number = 'i'+tone
        symbol = _get_keys(PHONETIC_SYMBOL_DICT, number)
        return re.sub('i', symbol, pinyin)
    if 'iu' in pinyin:
        number = 'u'+tone
        symbol = _get_keys(PHONETIC_SYMBOL_DICT, number)
        return re.sub('u', symbol, pinyin)
    for vowel in ['i', 'u', 'v']:
        if vowel in pinyin:
            number = vowel+tone
            symbol = _get_keys(PHONETIC_SYMBOL_DICT, number)
            return re.sub(vowel, symbol, pinyin)

@Jackiexiao 感谢分享!

是的, large_pinyin.txt 的准确性比较差,最主要是因为内容太多会导致程序内存占用太大不适合所有用户使用。

词典格式是因为带声调的拼音最准确(中文词典/字典都是用的声调拼音),使用数字的话有时候把数字反向转换为声调会有点麻烦,有些情况会无法还原正确的声调拼音。

哪些情况无法还原正确的声调拼音?另外pypinyin的最大前向匹配分词算法实现有bug,它会按照词典的前缀来分词,这样分词效果不准确,比如“在一起,一片”两个词在词典里。对“在一片”进行分词结果得到“在一/片”,导致拼音标注也不准确(假设“一”按变调规则标注: yi1 yi2 yi4)

@Jackiexiao 当时用前缀分词是为了节省内存,没考虑到这个问题,详见 mozillazg/python-pinyin#81 。如果有更好的分词实现的话,欢迎分享分词思路,欢迎提交 PR 。