Project Plagiarism Detection

This project relate to building a plagiarism detector that examines a text file and performs binary classification; labeling that file as either plagiarized or not, depending on how similar that text file is to a provided source text. Detecting plagiarism is an active area of research; the task is non-trivial and the differences between paraphrased answers and original work are often not so obvious..

Getting Started

General Outline

This project will be broken down into three main notebooks:

Notebook 1: Data Exploration

Load in the corpus of plagiarism text data.
Explore the existing data features and the data distribution.
This first notebook is not required in your final project submission.

Notebook 2: Feature Engineering

Clean and pre-process the text data.
Define features for comparing the similarity of an answer text and a source text, and extract similarity features.
Select "good" features, by analyzing the correlations between different features.
Create train/test .csv files that hold the relevant features and class labels for train/test data points.

Notebook 3: Train and Deploy Your Model in SageMaker

Upload your train/test feature data to S3.
Define a binary classification model and a training script.
Train your model and deploy it using SageMaker.
Evaluate your deployed classifier.

Prerequisites

NumPy - A fundamental package for scientific computing with Python.
Pandas - A library providing high-performance, easy-to-use data structures and data analysis tools.
ScikitLearn - Simple and efficient tools for data mining and data analysis
Matplotlib - Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms
Pickle - The pickle module implements binary protocols for serializing and de-serializing a Python object structure.
Sea Born - Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.
boto3 - Boto is the Amazon Web Services (AWS) SDK for Python. It enables Python developers to create, configure, and manage AWS services, such as EC2 and S3. Boto provides an easy to use, object-oriented API, as well as low-level access to AWS services.
SageMaker - SageMaker Python SDK is an open source library for training and deploying machine learning models on Amazon SageMaker.

You will also need to have software installed to run and execute a Jupyter Notebook

If you do not have Python installed yet, it is highly recommended that you install the Anaconda distribution of Python, which already has the above packages and more included.

Code

The project is divided into two parts. The code is provided in the 1_Data_Exploration.ipynb,2_Plagiarism_Feature_Engineering.ipynb and 3_Training_a_Model.ipynb notebook file. You will also be required to use aws SageMaker platform in the section Linear Learner to execute the code. This section is executed on Amazon SageMaker platform notebook. LinearLearner is a buitlin algorithm and we are only able to train and deploy this algorithm on Amazon SageMaker.

Run

In a terminal or command window, navigate to the top-level project directory Project_Plagiarism_Detection/ (that contains this README) and run one of the following commands:

ipython notebook 1_Data_Exploration.ipynb.ipynb

jupyter notebook 1_Data_Exploration.ipynb.ipynb

This will open the Jupyter Notebook software and project file in your browser.

Data

In this project datasets are provided by Udacity and limited to this project.

Running the tests

Once our model is deployed, we can see how it performs when applied to the test data. Assuming data is stored locally in data_dir and named test.csv. The labels and features are extracted from the .csv file.

We use our deployed predictor to generate predicted, class labels for the test data. Then we Compare those to the true labels, test_y, and calculate the accuracy as a value between 0 and 1.0 that indicates the fraction of test data that our model classified.

Break down into end to end tests

In this project we choosed two models LinearLearner which is a builtin sagemaker algorithm and PyTorch Neural Network Classifier. After that we compare and select the best one.

To implement a custom classifier, we'll need to complete a train.py script. we've been given folder source_pytorch which hold starting code for PyTorch model, respectively. This directory has a train.py training script.

Built With

Amazon SageMaker - The web framework used
Amazon S3 - The web storage used
Amazon API - The API used

Authors

Keyvan Tajbakhsh - keyvantaj

See also the list of contributors who participated in this project.

License

This project is licensed under the MIT License - see the LICENSE.md file for details

keyvantaj/Project_Plagiarism_Detection