Predicting Criminal Sentences
jc882, yl477, jp624
Where's the raw data that we used in our report?
In raw_data/
Requirements
- Python 2.7+ (*incompatible with Python 3.x)
- pip (a package management system for Python)
- numpy and scipy (pip install numpy, pip install scipy)
- Orange (pip install orange)
- scikit-learn (pip install sklearn)
- R and rpy2 (pip install rpy2)
File Structure
- bin/ - contains SVM binaries
- data/ - contains (sample) data for learning
- data_reg/ - contains output for linear/svm regression
- data_stat/ - contains output for doing the binomial sign test
- data_svm/ - contains SVMLight-compatible data files ** data/voting.tab - sample data (binary decision variable) ** data/voting_3class.tab - sample data (3-class decision variable)
- src/ - code to run
- src/includes - helper methods and classes
How to use
- You should only need to use src/main.py and src/main_reg.py to figure out how to perform learning on the data (run it to see!)
- Add svm_light binaries to the bin/ folder (create it if it doesn't exist)
- Install all the packages listed in the requirements and ensure that they are working
- Many functions assume the data in the first column is the class/label - so make sure that's the case!
Creating an Example File
- Parse.py can be used and configured from the command line
- Type python parse.py --help to see all the possible options available to generate data
Manipulating the Example File
- Get output from parse.py
- Use Converters.split_orangetab_into_2 to split the dataset into 2 - training and validation.
- Use Converters.orangetab_to_svmlight to convert an orange file into an SVMLight-compatible format.
Tuning
- Should be self-explanatory. Set the TRAIN_FILE to the training dataset and VAL_FILE to the validation dataset, and make a note of the optimal parameters, and the classification accuracy for that set of optimal parameters.
Output
- Knowing the optimal parameters, the attrs variable, and run oc.cross_validate().
Sign Test
- Run print_linear_svm, print_decision_tree, and so on to print out each algorithm's predicted labels to /data_svm.
- You also need to run Converters.write_actual_labels to print the true labels.
- Validators.binomial_sign_test can then be used on the true labels, and predicted labels from 2 algorithms, to see if one of them performs better than the other.
Precision/Recall
Procedure same as Sign Test, except you should call Validators.precision_recall instead.