code-authorship

Code authorship analysis project for LING 227 course.

Where does the data go?

Data goes in the data/ folder. There's a different folder for each author. Make sure to add a corresponding out/ folder.

Running the parser

Before you run the authorship analysis, you need to parse the data into JSON files. Just run the script ./json_parse and you'll be good to go. (You do need Maven in order to run the parser.)

Running the analysis

[MAXSIM|SVM|TSNE|BASE] denotes a mode. MAXSIM, SVM, and BASE run their respective models; TSNE prints out a graph of the flattened vectors.
[style|struc|all] denotes what feature set you want to use.
<max_num_files> denotes the number of files for each author you want to use. It's suggested that you make this the number of files in your smallest category.
<number_cross_validations> denotes the number of cross validations to use when evaluating the model.
<authors...> is a list of the authors/groups you'd like to analyze, according to the name of their data and out folders.

Example: ./analyze.sh SVM all 800 10 jigsaw jetty (the optimal case for us!)

mmkrusniak/code-authorship

code-authorship

Where does the data go?

Running the parser

Running the analysis