#Zhan-Zhian-Azadi

Plagiarism Detector

This simple project helps you to detect basic plagiarism in any two text files.

How to run the app

you can simply type this command in terminal to compare two files:

make FILE1=<path to first file_1> FILE2=<path to second file_2> run

Algorithm

This is a simple approach based on the occurrence of an individual word in the files we are testing.

The description is as follows:

- The first step is to read each file.

- Then, a pre-processing procedure is done on the result string by replacing the 'tab' and 'enter' characters and 
  removing all the punctuations from the whole file string.

- After these two steps, we will have two lists of individual words related to both files.

- To remove duplicate words in each list, we cast the list to set and generate a union set containing all the
  separate words in both files.

- The size of this set is m+n-k, in which m is the number of words in the first file, n is the number of words in 
  the second file, and k is the number of similar words in both files.

- As we are processing these words in the main loop, the order of the algorithm would be O(m+n-k).

- We also have a bag of the most common words in English that are declared at the beginning of the code as a final 
  list.

- In the main loop, as we detect the number of occurrences of each word existing in set_union_keys, we increment the
  value of each word by one and each common word by 0.1.

- At the end, we would have an embedding dictionary for each file that is a vector with a length of m+n-k, and each 
  element shows the weight of the word.

- Then, we normalize these values in dictionaries to make our comparison more reasonable. This is done by dividing 
  each value by the magnitude of a vector generated by all the weights of each word in the dictionary corresponding 
  to each file.

- For comparing these two embeddings, we use the equation |v1 - v2| / |v1 + v2| that varies between '0' and '1' for 
  the maximum similarity and the maximum difference, respectively.

- And for the decision-making, we compare the ratio we calculated with a default threshold of 0.5, which can be 
  changed regarding our needs.

- The output would be '1' if the ratio is less than the defined threshold of '0' if it is bigger than the threshold.
  If any exception occurs, the output would be '-1'.

- This code runs in less than one millisecond for almost all the provided sample test files.

- The accuracy in detecting plagiarism is 100% for the provided sample test files.

- The code is developed as it can be easily modified to meet OOP guidelines for better extension and maintenance.