/snape

Primary LanguagePythonApache License 2.0Apache-2.0

Build status Coverage Status

Snape

Snape is a convenient artificial dataset generator that wraps sklearn's make_classification and make_regression and then adds in 'realism' features such as complex formating, varying scales, categorical variables, and missing values.

Motivation

Snape was primarily created for academic and educational settings. It has been used to create datasets that are unique per student, per assignment for various homework assignments. It has also been used to create class wide assessments in conjunction with 'Kaggle In the Classroom.'

Other users have suggested non-academic uses cases as well, including 'interview screening problems,' model comparison, etc.

Installation

Via Github

git clone https://github.com/mbernico/snape.git
cd snape
python setup.py install

Via pip

Coming Soon...

Quick Start

Snape can run either as a python module or as a command line application.

Command Line Usage

Creating a Dataset

From the main directory in the git repo:

python snape/make_dataset.py -c example/config_classification.json

Will use the configuration file example/config_classification.json to create an artificial dataset called 'my_dataset' (which is specified in the json config, more on this later...).

The dataset will consist of three files:

  • my_dataset_train.csv (80% of the artificial dataset with all dependent and independent variables)
  • my_dataset_test.csv (20% of the artificial dataset with only the dependent variables present)
  • my_dataset_testkey.csv (the same 20% as _test, including the dependent variables)

Note that if a star schema is generated, additional csv files will be generated. There will be one extra csv file per dimension, but only the main 'fact table' dataset will be split into test and train files.

The train and test files can be given to a student. The student can respond with a file of predictions, which can be scored against the testkey as follows:

Scoring a Dataset

snape/score_dataset.py  -p example/student_predictions.csv  -k example/student_testkey.csv

Snape's score_dataset.py will attempt to detect the problem type and then score it, printing some metrics

Problem Type Detection: binary
---Binary Classification Score---
             precision    recall  f1-score   support

          0       0.81      0.99      0.89      1601
          1       0.50      0.06      0.11       399

avg / total       0.75      0.80      0.73      2000

Python Module Usage

Creating a Dataset

from snape.make_dataset import make_dataset

# configuration json examples can be found in doc
conf = {
    "type": "classification",
    "n_classes": 2,
    "n_samples": 1000,
    "n_features": 10,
    "out_path": "./",
    "output": "my_dataset",
    "n_informative": 3,
    "n_duplicate": 0,
    "n_redundant": 0,
    "n_clusters": 2,
    "weights": [0.8, 0.2],
    "pct_missing": 0.00,
    "insert_dollar": "Yes",
    "insert_percent": "Yes",
    "n_categorical": 0,
    "star_schema": "No",
    "label_list": []
}

make_dataset(config=conf)

Scoring a Dataset

from snape.score_dataset import score_dataset

# a dataset's testkey can be compared to a prediction file using score_dataset()
results = score_dataset(y_file="student_testkey.csv", y_hat_file="student_predictions.csv")
# results is a tuple of (a_primary_metric, classification_report)
print("AUC = " + str(results[0]))
print(results[1])

Dataset Generation Config

  1. Classification JSON
  2. Regression JSON

Why Snape?

Snape is primarily used for creating complex datasets that challenge students and teach defense against the dark arts of machine learning. :)