模式识别大作业简单查重实现,对一系列中文文本,查询其与语料库[LCMC]的相似度情况。 运行较慢,可以使用稀疏矩阵优化。
python 3.6
numpy
gensim
jieba (if use Chinese corpus)
nltk (if use English corpus)
- download any pre-trained Chinese word vectors from here: [Embedding/Chinese-Word-Vectors: 100+ Chinese Word Vectors]
- download LCMC corpus from here: [The Lancaster Corpus of Mandarin Chinese]
- run the code below for more information:
python run_and_test.py -h
The language for screen output is Chinese.
--pre_train: --thre 0.7
--no-pre_train: --thre 0.9