new_words_discovery: A Python repository from colinwke

基于统计的无监督中文分词

原文地址: http://www.matrix67.com/blog/archives/5044

基于两个成词的两个条件的假设:

凝固度
- 比如"键盘"的凝固度: $solid = \frac{freq(键盘)}{freq(键) + freq(盘)}$
自由度
- "键盘"的自由度: 键盘左邻字的熵与右邻字的熵

使用方法(example.py):

"""
对分词结果进行分析
"""
import pandas as pd
from _core.freq_calculator import WordCut, read_file

# 读取语料
corpus = read_file(r'./corpus/Swordsman.txt')
# 去除多余的空格
corpus = [''.join(x.split()) for x in corpus]

# 切词
word_cut = WordCut(min_freq=4)  # 最小词频为4, 越小越慢
result = word_cut.cut(corpus)  # 计算词语信息

# 保存结果
result = pd.DataFrame(result, columns=word_cut.get_columns_name())
print(result.describe())
result.to_csv('result.csv', index=False, encoding='utf-8')

切词信息(可通过分位数信息进行过滤阈值的选取)

WordCut information:
min_freq: 4
word_max_len: 11
count bin_word total : 715713
count bin_word unique: 132309
corpus solid degree  : 0.815137
================  PdPrinter.print_full  ================
             length      frequent  entropy_left  entropy_right   solid_degree
count  32417.000000  32417.000000  32417.000000   32417.000000   32417.000000
mean       2.736589     19.728877      2.648585       2.635318     296.195382
std        0.921086     71.071457      1.186383       1.227081    3067.525635
min        2.000000      5.000000      0.103275       0.117595       1.000280
25%        2.000000      6.000000      1.921928       1.921928       3.500381
50%        2.000000      8.000000      2.521641       2.521641      11.374101
75%        3.000000     15.000000      3.188722       3.250000      50.721759
max        9.000000   6782.000000     11.576797      10.649964  143142.600000

提取示例(笑傲江湖):

word,length,frequent,entropy_left,entropy_right,solid_degree
令狐,2,6782,11.576797175094372,0.8715216865031826,98.49217766897434
令狐冲,3,5954,11.478548750485759,7.275784432264426,93.99149100516193
什么,2,2109,5.662919848630054,8.3715346152597,209.5609075778746
山派,2,1640,2.060570055780612,6.1555117371002845,82.18053669255931
说道,2,1629,9.729354959628736,10.6499644658698,13.870566546245739
了一,2,1238,6.949119752779786,5.54763535954332,4.345274762356049
自己,2,1231,7.498273403721695,8.024101218827067,205.74226694103865
岳不群,3,1194,9.317047130639809,7.0761493108903295,208.09587600372572
一个,2,1186,7.201986921671323,7.556966059024818,15.445934945354473
弟子,2,1177,5.744631054003196,8.536645155111877,66.85190748279885
剑法,2,1139,5.544546538738371,8.29125603540392,74.29849169516365
也不,2,1126,7.254524392047882,5.978795970544172,10.598009026838083
师父,2,1108,7.20660380750861,8.427032495404294,107.9037923962667
不是,2,1103,5.682717786561449,8.182681365467001,3.6361191789122365
盈盈,2,1054,8.72553708261531,6.993663034228474,166.74181969612272
华山,2,1009,6.927067967420202,4.033540625085696,173.77673429756393
一声,2,996,3.5996330250417885,8.411177315225054,12.115111362128106

然后通过对每列的过滤保留提取的词语, 其中, entropy_left(左自由度), entropy_right(右自由度), solid_degree(凝固度)都是越高越好.

例如"令狐"的右自由度非常的低(0.8715), 所以可以被过滤掉.

100万字用了21s, 算比较快的.

其他参考:

java版本:

https://github.com/sing1ee/dict_build

python版本:

https://github.com/izisong/new-words-discovery

https://github.com/c19/ChineseSlicer

colinwke/new_words_discovery