zhangsonglei/Ngram

java代码的训练结果与ngram-count结果不一致的问题

Closed this issue · 2 comments

ngram-count训练参数:
1.统计词频
D:\cygwin64\home\chubb\srilm\bin\cygwin64\ngram-
count.exe -text .\ngram_data -order 3 -write .\ngram_data_CN
2.语言模型
D:\cygwin64\home\chubb\srilm\bin\cygwin64\ngram-
count.exe -read .\ngram_data_CN -order 3 -lm .\ngram_data_LM -ukndiscount1

java代码训练参数
java -cp ChineseSpellingCheck.jar hust.tools.ngram.app.NGramLMTrain train.dat utf-8 3 Y kn trigram.lm text

试验结果:
ngram-count结果
\data
ngram 1=5685
ngram 2=531648
ngram 3=1224327

\1-grams:
-2.663042 ! -1.557407
-2.955509 " -0.7174896
-3.874366 # -1.386763
-4.771382 $ -0.4257379
-3.893115 % -1.282743
-3.387168 & -1.997125
-3.668719 ' -0.6878299
-2.921485 ( -0.7731137
-2.906739 ) -1.016507
-3.555362 * -1.446827

java代码结果
3
5691
531679
1224330
-4.596007237193767 -0.30495468198354575
-2.6632636494389934 ! -0.6929173857907096
-2.9559188903417892 " -0.46791485367975977
-3.877592719528297 # -0.5462744250640476
-4.79732935876629 $ -0.34568711194645774
-3.8964839459321765 % -0.4328627851501955
-3.388233134376914 & -1.3182228221695054
-3.670735836633295 ' -0.42412574478336396
-2.9218657541928734 ( -0.4112865266073557
-2.90710789326123 ) -0.5585262636460637
-3.5569207307321604 * -0.5499648255080977
-3.1974033687466386 + -0.5313955482036398

问题:
1.概率基本一致,但是回退概率不一致
2.数量有微小差别,除了加入了<unk>,还有第一个概率较高的没有显示,还有最后几个 unigram最后的4个?,估计是特殊字符,没有显示,所以比ngarm-count的1-gram多了6个。

抱歉没有及时回复,当初做的比较粗糙,确实有些问题没有处理细致。感谢您的实验和建议,有时间我会持续改进。

谢谢