This is a preliminary repo for classifiers restricted to make predictions in terms of a weighted average of training data, specifically for text datasets.
- python3
- pytorch 0.4.0
- torchvision
- numpy
- spacy
- scikit-learn
In addition to setting up a python environment with the packages listed above, these models assume access to Glove embeddings, which can be downloaded from https://nlp.stanford.edu/projects/glove/
By default, the models will look for the embeddings in data/glove/
but a different location can be specified at run time.
To train a model using one of the pre-specified datasets, such as StackOverflow, use:
python run.py --dataset stackoverflow
This will download the dataset to data/stackoverflow/raw
, preprocess it, train a baseline CNN model, predict on the test data, and save the output to data/temp/
.
The output directory will contain files for the train, dev, and test data, each of which is .npz file containing labels, predictions, and predicted probabilities.
To train a weighted averaging model, add --model act
To train a model on a dataset that has not been prespecified, create a directory called data/[name]/raw/
, where [name]
is the name of your dataset. In that directory, create files called train.jsonlist
and test.jsonlist
. Each of those files should contain one document per line. Each line should be a JSON object with at least two fields: "text" and "label".
For example, the first line of a file could be the following JSON object:
{"text": "This is a positive document", "label": "positive"}
To train a model on this data, use:
python run.py --dataset [name]
again replacing [name]
with the name of your dataset as above.
This will load the data, tokenize the text, and then proceeed as above.
To choose the size of the output layer for the averaging classifier, use --z-dim [dz]
, where [dz]
will be the dimensionality.
To train on a GPU, include the option --cuda
.
To choose a different output directory, use --output-dir [output-dir]
where [output-dir]
is the desired target directory.
For additional options, such as model size and optimization choices, run:
python run.py -h
The eval
directory contains a number of scripts to help with evaluation. For example, to evaluate the calibration (and accuracy) of the predictions on test data in the data/temp/
directory, use:
python -m eval.eval_calibration data/temp/test.npz
To inspect the calibration and confidence values, and correcteness at a given epsilon value, say 0.1, use:
python -m eval.eval_conformal data/temp --eps 0.1
To evaluate these using the sum of weights rather than the probabilities, add --weights
.