基于统计的无监督中文分词
原文地址: http://www.matrix67.com/blog/archives/5044
基于两个成词的两个条件的假设:
- 凝固度
- 比如"键盘"的凝固度:
$solid = \frac{freq(键盘)}{freq(键) + freq(盘)}$
- 比如"键盘"的凝固度:
- 自由度
- "键盘"的自由度: 键盘左邻字的熵与右邻字的熵
使用方法(example.py):
"""
对分词结果进行分析
"""
import pandas as pd
from _core.freq_calculator import WordCut, read_file
# 读取语料
corpus = read_file(r'./corpus/Swordsman.txt')
# 去除多余的空格
corpus = [''.join(x.split()) for x in corpus]
# 切词
word_cut = WordCut(min_freq=4) # 最小词频为4, 越小越慢
result = word_cut.cut(corpus) # 计算词语信息
# 保存结果
result = pd.DataFrame(result, columns=word_cut.get_columns_name())
print(result.describe())
result.to_csv('result.csv', index=False, encoding='utf-8')
切词信息(可通过分位数信息进行过滤阈值的选取)
WordCut information:
min_freq: 4
word_max_len: 11
count bin_word total : 715713
count bin_word unique: 132309
corpus solid degree : 0.815137
================ PdPrinter.print_full ================
length frequent entropy_left entropy_right solid_degree
count 32417.000000 32417.000000 32417.000000 32417.000000 32417.000000
mean 2.736589 19.728877 2.648585 2.635318 296.195382
std 0.921086 71.071457 1.186383 1.227081 3067.525635
min 2.000000 5.000000 0.103275 0.117595 1.000280
25% 2.000000 6.000000 1.921928 1.921928 3.500381
50% 2.000000 8.000000 2.521641 2.521641 11.374101
75% 3.000000 15.000000 3.188722 3.250000 50.721759
max 9.000000 6782.000000 11.576797 10.649964 143142.600000
提取示例(笑傲江湖):
word,length,frequent,entropy_left,entropy_right,solid_degree
令狐,2,6782,11.576797175094372,0.8715216865031826,98.49217766897434
令狐冲,3,5954,11.478548750485759,7.275784432264426,93.99149100516193
什么,2,2109,5.662919848630054,8.3715346152597,209.5609075778746
山派,2,1640,2.060570055780612,6.1555117371002845,82.18053669255931
说道,2,1629,9.729354959628736,10.6499644658698,13.870566546245739
了一,2,1238,6.949119752779786,5.54763535954332,4.345274762356049
自己,2,1231,7.498273403721695,8.024101218827067,205.74226694103865
岳不群,3,1194,9.317047130639809,7.0761493108903295,208.09587600372572
一个,2,1186,7.201986921671323,7.556966059024818,15.445934945354473
弟子,2,1177,5.744631054003196,8.536645155111877,66.85190748279885
剑法,2,1139,5.544546538738371,8.29125603540392,74.29849169516365
也不,2,1126,7.254524392047882,5.978795970544172,10.598009026838083
师父,2,1108,7.20660380750861,8.427032495404294,107.9037923962667
不是,2,1103,5.682717786561449,8.182681365467001,3.6361191789122365
盈盈,2,1054,8.72553708261531,6.993663034228474,166.74181969612272
华山,2,1009,6.927067967420202,4.033540625085696,173.77673429756393
一声,2,996,3.5996330250417885,8.411177315225054,12.115111362128106
然后通过对每列的过滤保留提取的词语, 其中, entropy_left(左自由度), entropy_right(右自由度), solid_degree(凝固度)都是越高越好.
例如"令狐"的右自由度非常的低(0.8715), 所以可以被过滤掉.
100万字用了21s, 算比较快的.
其他参考:
java版本:
https://github.com/sing1ee/dict_build
python版本: