A Naive-Bayes classifier for detecting plagiarism, trained over a dataset of short answers developed by Clough and Stevenson.
To train the classifier, be sure to do the following first:
- Clone this repository.
- Download a modified version of the dataset.
- Place the dataset files in your cloned copy of the repository.
- Make sure you have installed all the Python packages defined in
requirements.txt
.
The feature engineering steps are defined in the 2_Plagiarism_Feature_Engineering.ipynb
jupyter notebook.
Most of the code is contained in the copycat_detector
module.
For training, notebook 3_Training_a_Model.ipynb
was run on an Amazon SageMaker instance.