OCaMOSS: An OCaml repository from yangdanny97

OCaMoss User's Guide

Plagiarism detection software, inspired by MOSS and implemented in OCaml. Runs using a command-line interface.

This is NOT an OCaml client for MOSS, this is a completely separate program. For more details about the system or how MOSS works in general, read the PDF report in this repository or this blog post I wrote.

Note: this was originally written as the final project for a course - it has since been updated by me, so some aspects of the PDF report may not be accurate. In particular, the latest version of OCaMOSS no longer uses a 2-3 tree.

To build - make
To build & run the REPL - make run
To build & run unit tests -make test

Required Dependencies:

Yojson
ANSITerminal
OUnit (for unit tests)

Commands:

(note - commands are case-sensitive)

run [threshold] - runs OCaMoss on the working directory. The threshold argument gives the program the percentage of the file to match with another for it to be flagged as plagiarised, and must be at least 0.4 and at most 1
dir - lists the working directory and the files that it contains
setdir [dir] - sets the relative directory to look for files and resets any results
results - lists the file names for which there are results
results [filename] - lists the detailed results of overlap for that file (Make sure to include the extension of the file)
resultpairs -- lists all the pairs of files for which there are positive results
compare [fileA] [fileB] - prints out specific overlaps of fileA and fileB (Make sure to include the extension of the files)
quit - exits the REPL
help - display the available commands

Usage instructions/tutorial:

setdir to folder you want to test. requirements: file names have no spaces and all files have the same extension (example: setdir tests/test1)
run with desired params (example: run 0.5 is the same as run)
results to view list of results, [results filename] to view list of results for specific file, and [compare A B to compare matching patterns for two files (example: results Camel.txt)

Example for runnning test case 1 and inspecting results:
1. setdir tests/test1
2. run
3. results/results intset.ml/compare intset1.ml intset.ml/resultpairs

Other information:

Similarity score:

used as a measure of how likely file A plagiarized from file B
ratio of # matching hashes between A and B : # hashes in fingerprint for A
overall similarity score for A is the average of all similarity scores for file A that are > 0.5
threshold score for detecting possible plagiarism varies with the file type, but experimentally we determined it to be around 0.5

Supported languages/file formats:

OCaml - .ml
Java - .java
C - .c
Python - .py
English - .txt (note: english comparison does NOT account for semantics)

Self-generated test case descriptions (test case N is in directory tests/testN):

NOTE: to replicate results, run using threshold = 0.4