This is the code repo of my Master's Thesis. I tried giving it a clear structure, but some of it surely must be considered "academic code", sorry for that. Here's the structure in brief:
- /classifiers: All classifier/feature vectors combinations talked about in the thesis reside here. All of them extend a base class from extend util/classifiers.py
- /data: stories must be stored here. I could not commit the stories I worked on to this repo, so an interested reader must populate this folder
- /playground: In this folder I tried out new methods, classifiers, etc.
- /util: Two things are here: 1) some utility functions and classes, e.g. for creating cross validation datasets and distributing jobs across a cluster 2) some executables (see below) performing standalone utility functions like creating tables from JSON results
- Run ./install_dependencies.sh which should install a virtualenv with all necesarry dependencies
- Set up a directory structure in ./data, replacing the current one. For licensing reasons (I guess), the stories could not be committed to this repo, instead there are symlinks
- Store the path to the word embeddings in tester.ini. As they amount to 50GB, they are not available online (cannot afford the possible traffic :( ) but can be requested. Or train them yourself using util/trainmodel.py
- Play around with ./tester.py
- Use ./nestedCrossVal.py and ./crossVal.py to replicate results
For ./tester and ./(nested)crossVal.py the following applies:
- For each dataset at least the function
isAroused
in util/init.py must be implemented. Datasets will then be discovered automatically. - Classifiers are detected automatically from ./classifiers (at most once per file) and can be chosen by passing its classname to the -c flag
- The path to a word embedding must be put into ./tester.ini and then it can be selected by passing the chosen name to the -m flag
- In order to be used as a master or worker node, ./secretFile must be initialised with random data for authentication
This are the tools available:
- ./tester.py: Can be used to train, validate and crossvalidate a specific dataset. I used it mostly for testing modifications I made locally before cross validating it
- ./nestedCrossVal.py: Can be used to run a nested cross validation on a dataset. It supports
"local" mode, where all computation is done in parallel on the current machine.
"worker" mode, where it simply waits for a tuple of hyperparameters to cross
validate and
"master" mode, where it distributes all tuples of hyperparamters to the workers
The hyperparamters to optimize must be passed as JSON after --, e.g.
./nestedCrossVal.py -v local -c DocSumSVMClassifier -m your.model --datasets Veroff -- '{"SVM_C":[1,2,3]}'
- ./crossVal.py: It's basically the same as nestedCrossVal.py but performs a simple cross validation
Furthermore some additional tools in ./util:
- ./util/calculateWinterBaseline.py: Quite uninteresting, produces the Winter baseline cross validation results
- ./util/storyscraper.py: Can be used to compile your own fanfiction corpus (cannot distribute my own, due to licensing)
- ./util/tablizeJSON.py: Can be used to turn JSON results from nested cross validation into pdfs
- ./util/trainmodel.py: Can be used to train GloVe and word2vec embeddings from wikipedia and fanfiction corpora
- ./util/weighPerAuthor.py: Can be used to convert JSON results per author instead per story