
A implementation of Chinese Words Correction.

Author: Hanwen, LIU - HKUST

A Chinese words correction system with detection and correction functions based on n-gram language model and Chinese text segmentation. The detection core focused on the continuous singletons while the correction core focuses on the the shape and pronunciation similarity of characters.


In [1]: import Checker


In [2]: fix = Checker.correct_core('我已经等猴多时了。')

In [3]: for word in fix:
   ...:     print(word)

['等候多时', '得道多助', '勇而多计']
In [4]: fix = Checker.correct_core('我已经等猴多时了。')

In [5]: answer = ''

In [6]: for word in fix:
   ...:     if type(word)==list:
   ...:         answer+=word[0]
   ...:     else:
   ...:         answer+=word

In [7]: print(answer)

Please make sure that the data files mentioned in the Checker.py be downloaded! Some files are too large, please download them from Google Drive: https://drive.google.com/open?id=1A_rifWNTVLkPeTfKTPN-KaeaqTj09-IG


  • Checker.py: Detection and correction system module.
  • CharSimilarity.py: Characters similarity measurement module.
  • Experimental Results.ipynb: Experimental Results.
  • sijiao_dict.py: Sijiao codes of characters from https://github.com/contr4l/SimilarCharactor.
  • similar_char_preprocessing.py: The cache of all similarity values between common characters.
  • testing data(folder): Testing data files
  • chinese_word_correction_data.json: The original training data.(Provided by Porf.Lei CHEN)
  • weibo_contents_words.set: The vocabulary file of training data.
  • weibo_contents_words.bin: The trained binary KenLM language model.
  • pd_simi_dic.pkl: The similarity cache file, can be generated by similar_char_preprocessing.py.

Future Work

Although the performance of our correction system is acceptable, there are some problems which should be solved in the future.

First of all, the correction speed of out design is extremely low. We have designed some elaborate algorithm to speed up the system, for example, we calculated common characters' similarity values with each other in advance. However, the combinations number is large and the comparison times are many, so the total speed of correction is very low. Because of the low speed, we can only test our corrector with small test data which may not be compellent.

The next problem is the detection accuracy. We have noticed that if there are several singleton words appear which are not error words, the detector will treat them as error words since this is how our detection algorithm works.

The last problem is the candidates selecting algorithm in the correction part. We select the candidates by combining the candidates with the prefix words and query the language model for score. However, the score of candidates with longer length may probably get higher score. For example, when correcting the sentences '平民刘备鉴持卟鞋', the prefix words are '平民' and '刘备' and the candidates are '坚持/补鞋' and '坚持不懈'. Although the first candidate is more similar, but the score of the first candidate may probably be lower than the second one, because the 4-gram score usually lower than 3-gram score.


