PR_project_plag_detect

模式识别大作业简单查重实现，对一系列中文文本，查询其与语料库[LCMC]的相似度情况。运行较慢，可以使用稀疏矩阵优化。

Requirments

python 3.6

numpy

gensim

jieba (if use Chinese corpus)

~~nltk (if use English corpus)~~

download any pre-trained Chinese word vectors from here: [Embedding/Chinese-Word-Vectors: 100+ Chinese Word Vectors]
download LCMC corpus from here: [The Lancaster Corpus of Mandarin Chinese]
run the code below for more information:

python run_and_test.py -h

The language for screen output is Chinese.

--pre_train: --thre 0.7

--no-pre_train: --thre 0.9