Correction

Overview

A project to correct spelling errors in Chinese texts 中文纠错任务

大致思路

使用语言模型计算句子或序列的合理性
bigram, trigram, 4-gram 结合，并对每个字的分数求平均以平滑每个字的得分
根据Median Absolute Deviation算出outlier分数，并结合jieba分词结果确定需要修改的范围
根据形近字、音近字构成的混淆集合列出候选字，并对需要修改的范围逐字改正
句子中的错误会使分词结果更加细碎，结合替换字之后的分词结果确定需要改正的字
探测句末语气词，如有错误直接改正

Idea

Make use of language models to calculate the likelihood of a sequence of words or a sentence
Combine the scores of bigram, trigram, and 4-gram scores and take the average for each character to smooth the score
Determine the outliers by Median Absolute Deviation (MAD) and figure out the range of characters to be corrected
Generate the confusion set based on character sets of similar shape/pronunciation with the target character
Correct the characters in the range of correction one by one
The error characters in the sentence would make the word segementation results have smaller granularity. Determine the character to be replaced considering the results of jieba segementation.
Errors in modal particles are common. Correct them directly at the end of sentences.

TODO

使用RNN语言模型推算每个字的合理概率（正反双向），以加强长距离前后文关系
构建更小更贴近现实的混淆集合（形近字和近音字）
从现实中收集更多的有语病或错别字的句子并标注
Incorporate RNN language models to capture more context information
Collect smaller confusion sets and make the sets more closed to daily life
Collect and annotate Chinese sentences with grammatical errors from daily life

文件结构

data/
* sighan/: SIGHAN contests data
* bcmi_data/: 源自生活的语病数据集 Dataset of sentences with grammatical errors
* wikipedia/: 中文维基数据集 xml文件、纯文本、提取工具 Tools to preprocess Chinese Wikipedia texts
* simp.pickle: similar pronunciation characters dictionary
* sims.pickle: similar shape characters dictionary
* simp_simplified.pickle: 过滤掉字频100一下的非常用字的版本 Similar pronuncation characters with less common characters filtered out
* xjz.pickle: 简明形近字dictionary Simple similar shape dictionary
*

kenlm/: library to generate statistical language models
kenmodels/: trained language models *.klm are binary files

nlm/: various neural language models
* tf_char_rnn: Character-level RNN language model implemented using TensorFlow
* cn_char_rnn: RNN language model for Chinese
* lstm_char_cnn: A CNN language model, not useful for this project
* char_rnn: Character-level RNN language model gists

spells/: useful tools for English spelling check

langconv.py: 简繁转换工具 Tools to convert simplified/traditional Chinese characters
zh_wiki.py: 简繁转换dictionary

tf_char_rnn/:
* checkpoints/: 17 is backward model, 10 is forward model
* logs/: summaries of training runs
* data/: text input to train models
* model.py: script describing the model
* train.py: script to train the model
* sample.py: script to sample texts and calculate per-char probabilities of a sequence
* utils.py: utilities for reading data and generating batches

results.out: 输出log

参考链接 References

RNN语言模型: http://karpathy.github.io/2015/05/21/rnn-effectiveness/
音型码: http://mabusyao.iteye.com/blog/2267661
Spellchecking by computer: http://www.dcs.bbk.ac.uk/~roger/spellchecking.html
GNU Aspell: http://aspell.net/
搜狗实验室数据: http://www.sogou.com/labs/resource/list_pingce.php
基于seq2seq模型的中文纠错任务: http://media.people.com.cn/n1/2017/0112/c409703-29018801.html
char-rnn-tensorflow: https://github.com/fujimotomh/char-rnn-tensorflow
语言模型KenLM的训练及使用: http://www.cnblogs.com/zidiancao/p/6067147.html
KenLM: https://github.com/kpu/kenlm
基于语言模型的无监督分词: http://spaces.ac.cn/archives/3956/
中文句結構樹資料庫: http://turing.iis.sinica.edu.tw/treesearch/
CKIP数据集: http://rocling.iis.sinica.edu.tw/CKIP/engversion/index.htm
达观数据搜索引擎的Query自动纠错技术和架构: http://www.datagrand.com/blog/search-query.html

cjjjy/correction

Correction

Overview

大致思路

Idea

TODO

文件结构

参考链接 References