/plagiarism-detection

An end-to-end plagiarism classification model deployed in AWS SageMaker.

Primary LanguageJupyter Notebook

Plagiarism Project, Machine Learning Deployment

This repository contains code and associated files for deploying a plagiarism detector using AWS SageMaker.

Project Overview

In this project, I've built a plagiarism detector that examines a text file and performs binary classification; labeling that file as either plagiarized or not, depending on how similar that text file is to a provided source text. Detecting plagiarism is an active area of research; the task is nontrivial and the differences between paraphrased answers and original work are often not so obvious.

Below Sagemaker ML Instances are used:

  • Notebook: ml. t2. medium
  • Training: ml. c4. xlarge
  • Deployment: ml. t2. medium

This project will be broken down into three main notebooks:

Notebook 1: Data Exploration

This notebook loads in the corpus of plagiarism text data and explores the existing data features and the data distribution.

Notebook 2: Feature Engineering

This notebook cleans and pre-processes the text data. The features for comparing the similarity of an answer text and a source text have been defined here, and similarity features have been extracted. The "good" features have been selected by analyzing the correlations between different features. Finally, train/test .csv files have been created that hold the relevant features and class labels for train/test data points.

Notebook 3: Train and Deploy Your Model in SageMaker

In this notebook, I've uploaded the train/test feature data to S3. A binary classification model and a training script were defined. The model was trained and deployed using SageMaker, and finally tested.


Setup Instructions

The notebooks provided in this repository are executed using Amazon's SageMaker platform. The following is a brief set of instructions on setting up a managed notebook, for instance using SageMaker, from which the notebooks can be run.

Login to the AWS console and create a notebook for instance

Log in to the AWS console and go to the SageMaker dashboard. Click on 'Create a notebook for instance'.

  • The notebook name can be anything and using ml.t2.medium is a good idea as it is covered under the free tier.
  • For the role, creating a new role works fine. Using the default options is also okay.
  • It's important to note that you need the notebook instance to have access to S3 resources, which it does by default. In particular, any S3 bucket or object, with “Sagemaker" in the name, is available to the notebook.
  • Use the option to git clone the project repository into the notebook instance by pasting https://github.com/priyathamhub/plagiarism-detection.git

Open and run the notebook of your choice

Now that the repository has been cloned into the notebook instance you may navigate to any of the notebooks you wish to complete or execute and work with them. Additional instructions are contained in their respective notebooks.


This project submitted as a part of the Machine Learning Engineering Nanodegree Program at Udacity