A plagiarism detector that examines a text file and performs binary classification; labeling that file as either plagiarized or not, depending on how similar that text file is to a provided source text.
The notebooks need Python 3 environment.
One of the ways we might go about detecting plagiarism, is by computing similarity features that measure how similar a given text file is as compared to an original source text. Since our approach is based on this paper, we will create features called containment and longest common subsequence.
- Containment: first look at a whole body of text (and count up the occurrences of words in several text files) and then compare a submitted and source text, relative to the traits of the whole body of text.
- Longest Common Subsequence (LCS): the longest subsequence common to all sequences in a set of sequences (often just two sequences).
2_Plagiarism_Feature_Engineering.ipynb
is about feature creation and datasets preparation.
3_Training_a_Model.ipynb
is to train the deep learning model.
source_pytorch
folder contains functions to train model and predict results.
The trained model has 96% accuracy with 1 false positive and 0 false negative.
acutals\Predictions | 0 | 1 |
---|---|---|
0 | 9 | 1 |
1 | 0 | 15 |
The content of this repository is licensed under a MIT license