This project relate to building a plagiarism detector that examines a text file and performs binary classification; labeling that file as either plagiarized or not, depending on how similar that text file is to a provided source text. Detecting plagiarism is an active area of research; the task is non-trivial and the differences between paraphrased answers and original work are often not so obvious..
This project will be broken down into three main notebooks:
Notebook 1: Data Exploration
- Load in the corpus of plagiarism text data.
- Explore the existing data features and the data distribution.
- This first notebook is not required in your final project submission.
Notebook 2: Feature Engineering
- Clean and pre-process the text data.
- Define features for comparing the similarity of an answer text and a source text, and extract similarity features.
- Select "good" features, by analyzing the correlations between different features.
- Create train/test
.csv
files that hold the relevant features and class labels for train/test data points.
Notebook 3: Train and Deploy Your Model in SageMaker
- Upload your train/test feature data to S3.
- Define a binary classification model and a training script.
- Train your model and deploy it using SageMaker.
- Evaluate your deployed classifier.
- NumPy - A fundamental package for scientific computing with Python.
- Pandas - A library providing high-performance, easy-to-use data structures and data analysis tools.
- ScikitLearn - Simple and efficient tools for data mining and data analysis
- Matplotlib - Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms
- Pickle - The pickle module implements binary protocols for serializing and de-serializing a Python object structure.
- Sea Born - Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.
- boto3 - Boto is the Amazon Web Services (AWS) SDK for Python. It enables Python developers to create, configure, and manage AWS services, such as EC2 and S3. Boto provides an easy to use, object-oriented API, as well as low-level access to AWS services.
- SageMaker - SageMaker Python SDK is an open source library for training and deploying machine learning models on Amazon SageMaker.
You will also need to have software installed to run and execute a Jupyter Notebook
If you do not have Python installed yet, it is highly recommended that you install the Anaconda distribution of Python, which already has the above packages and more included.
The project is divided into two parts. The code is provided in the 1_Data_Exploration.ipynb
,2_Plagiarism_Feature_Engineering.ipynb
and 3_Training_a_Model.ipynb
notebook file. You will also be required to use aws SageMaker platform in the section Linear Learner
to execute the code. This section is executed on Amazon SageMaker platform notebook. LinearLearner is a buitlin algorithm and we are only able to train and deploy this algorithm on Amazon SageMaker.
In a terminal or command window, navigate to the top-level project directory Project_Plagiarism_Detection/
(that contains this README) and run one of the following commands:
ipython notebook 1_Data_Exploration.ipynb.ipynb
or
jupyter notebook 1_Data_Exploration.ipynb.ipynb
This will open the Jupyter Notebook software and project file in your browser.
In this project datasets are provided by Udacity and limited to this project.
Once our model is deployed, we can see how it performs when applied to the test data. Assuming data is stored locally in data_dir and named test.csv. The labels and features are extracted from the .csv file.
We use our deployed predictor
to generate predicted, class labels for the test data. Then we Compare those to the true labels, test_y
, and calculate the accuracy as a value between 0 and 1.0 that indicates the fraction of test data that our model classified.
In this project we choosed two models LinearLearner which is a builtin sagemaker algorithm and PyTorch Neural Network Classifier. After that we compare and select the best one.
To implement a custom classifier, we'll need to complete a train.py
script. we've been given folder source_pytorch
which hold starting code for PyTorch model, respectively. This directory has a train.py
training script.
- Amazon SageMaker - The web framework used
- Amazon S3 - The web storage used
- Amazon API - The API used
- Keyvan Tajbakhsh - keyvantaj
See also the list of contributors who participated in this project.
This project is licensed under the MIT License - see the LICENSE.md file for details