
新词发现 基于词频、凝聚系数和左右邻接信息熵

Primary LanguagePython

New Words Discovery


Before launch the script install these packages in your Python3(<=3.6) environment:


pandas == 0.21.1

This implementation tries to discover four types of new words based on four parameters.

Four types of new words:

  1. latin words, including

    1. pure digits (2333, 12315, 12306)

    2. pure letters (iphone, vivo)

    3. a mixture of both (iphone7, mate9)

  2. 2-Chinese-character unigram (unigrams are defined as the elements produced by the segmentator):


  3. 3-Chinese-character unigram:


  4. bigrams, which are composed of two unigrams:


Four parameters:

  1. term frequency (tf): The occurrences of a word. A larger tf indicates a larger confidence of the following 3 paramters.

  2. aggregation coefficient: A larger agg_coef indicates a larger possibility of the co-occurrence of the two words.

where C(w_1, w_2) indicates the counts of the pattern that w_1 is followed by w_2.

C(w_1) and C(w_2) indicate the count of the counts of w_1 and w_2 respectively.

  1. minimum neighboring entropy

  2. maximum neighboring entropy

The minimum and maximum neighboring entropy are the minimum and maximum of left neighboring entropy and right neighboring entropy respectively.

A larger neighboring entropy of a word w indicates that w collocates with mores possible words, which in turn indicates that w is an independent word. For instance, "我是" has a large tf and a large agg_coef but a small minimum neighboring entropy so it's not a word.

left entropy:

where w_l are the set of unigrams that appear to the left of word w. This above-mentioned formula also applies to the right neighboring entropy.


An execution script example (Note that the double quotes cannot be omitted if the path you provided contains spaces):

python run_discover.py "G:\Documents\Exp Data\CCF_sogou_2016\sogouu8.txt" "G:\Documents\Exp Data\CCF_sogou_2016\reports" --latin 50 0 0 0 --bigram 20 80 0 1.5 --unigram_2 20 40 0 1 --unigram_3 20 41 0 1 --iteration 2 --verbose 2


python run_discover.py --help

for further information and help.

Each iteration includes the following 11 steps:

  1. cutting
  2. counting characters
  3. counting unigrams
  4. counting bigrams
  5. counting trigrams
  6. calculating aggregation coefficients (for unigrams)
  7. counting neighboring words (for unigrams)
  8. Calculating boundary entropy (for unigrams)
  9. calculating aggregation coefficients (for bigrams)
  10. counting neighboring words (for bigrams)
  11. calculating boundary entropy (for bigrams)

After each iteration, you will get four files reporting new words of type latin, 2-Chinese-character words, 3-Chinese-character words and bigram respectively. After the program exits, you will get four files which respectively merge each type of new words generated from each iteration.

If you encounter any problems, feel free to open an issue or contact me (rayarrow@qq.com).





  1. 拉丁词,包括:

    1. 纯数字 (2333, 12315, 12306)

    2. 纯字母 (iphone, vivo)

    3. 数字字母混合 (iphone7, mate9)

  2. 两个中文字符的unigram (unigrams被定义为分词器产生的元素):


  3. 三个中文字符的unigram unigram:


  4. bigrams, 每个bigram由两个unigram组成



  1. 词频 (tf): 一个词出现的次数。词频越大,表明下面三个参数的置信度越高。

  2. 凝聚系数: 凝聚系数越大表明两个(字)词共同出现的概率越大(越不是偶然)。

其中C(w_1, w_2)是词w_1w_2共同出现的次数。


  1. 最小边界信息熵

  2. 最大边界信息熵







python run_discover.py "G:\Documents\Exp Data\CCF_sogou_2016\sogouu8.txt" "G:\Documents\Exp Data\CCF_sogou_2016\reports" --latin 50 0 0 0 --bigram 20 80 0 1.5 --unigram_2 20 40 0 1 --unigram_3 20 41 0 1 --iteration 2 --verbose 2


python run_discover.py --help



  1. cutting
  2. counting characters
  3. counting unigrams
  4. counting bigrams
  5. counting trigrams
  6. calculating aggregation coefficients (for unigrams)
  7. counting neighboring words (for unigrams)
  8. Calculating boundary entropy (for unigrams)
  9. calculating aggregation coefficients (for bigrams)
  10. counting neighboring words (for bigrams)
  11. calculating boundary entropy (for bigrams)


如果遇到任何问题,欢迎提出issue或者联系我 (rayarrow@qq.com).