
A fine selection of PRActically useful tabular DAtasets

Primary LanguagePythonMIT LicenseMIT

PraDa 💎 : A fine selection of Practically useful tabular Datasets

PraDa automatically downloads (only OpenML for now) and locally stores its fine selection of datasets on your computer in the HDF5 format for faster loading times.

Use this environment variable to inform PraDa where to store these HDF5 datasets:

export PRADA_DATA_DIR=/your/data/cache/directory


PraDa is currently not a PyPI package, but you can install it straight from Github (https://pip.pypa.io/en/stable/topics/vcs-support/)

pip install git+https://github.com/laudv/prada.git

Loading a dataset

List all the datasets:

import prada

Load a dataset using the class name:

d = prada.Phoneme()

d.X.shape # (5404, 5)
d.y.shape # (5404,)

Load a dataset using a name of a dataset:

d = prada.get_dataset("Phoneme")

Load a multiclass dataset and turn it into a binary dataset by comparing only two classes:

d = prada.Mnist()
d2v4 = d.one_vs_other(2, 4)

# or

d = prada.get_dataset("Mnist[2v4]")

Similarly, turn a regression dataset into a (binary) classification dataset:

d = prada.get_dataset("WineQuality[bin]")

# or

d = prada.WineQuality()

For more of these functions, have a look at the RegressionMixin and MulticlassMixin mixins.

Hyper-parameter optimization

Iterate over a grid of parameters:

d = prada.Spambase()

param_dict = {"n_estimators": [10, 20], "eta": [0.5, 0.9] }
for i, params in enumerate(d.paramgrid(**param_dict)):
    print(i, params)

This prints:

0 {'n_estimators': 10, 'eta': 0.5}
1 {'n_estimators': 10, 'eta': 0.9}
2 {'n_estimators': 20, 'eta': 0.5}
3 {'n_estimators': 20, 'eta': 0.9}

Train a model for a given parameter set:

dtrain, dtest = d.train_and_test_fold(fold)
dtrain, dvalid = dtrain.train_and_test_fold(0)

# `model_class` can be any sklearn compatible classifier.
# There is built-in support for
#   - rf:  sklearn RandomForest
#   - xgb: xgboost
#   - lgb: lightgbm
model_type = "xgb" # or "rf", "lgb"
model_class = d.get_model_class(model_type)
clf, train_time = dtrain.train(model_class, params)

mtrain = dtrain.metric(clf)
mtest  = dtest.metric(clf)
mvalid = dvalid.metric(clf)

Utility functions

d = prada.Banknote(nfolds=5, seed=5232, silent=True)

d.name() # Banknote

d.source    # openml
d.openml_id # only for openml datasets
d.url       # only for openml datasets

d.is_regression() # False
d.is_binary()     # True
d.is_multiclass() # False
d.astype(np.float32) # cast d.X and d.y
d.minmax_normalize() # sklearn.MinMaxScaler
d.robust_normalize() # sklearn.RobustScaler
d.scale_target() # for regression problems

# Metric: RMSE for regression, Accuracy for classification
# either evaluates a given classifier on `d.X` ...
d.metric(clf: sklearn_compatible)
d.metric(at: veritas.AddTree)
# ... or just applies the relevant metric to the given values
d.metric(ytrue, ypred)

Using it a click command line interface

import prada
import click

def cli():

@click.option("-m", "--model_type", type=click.Choice(["xgb", "rf", "lgb"]),
@click.option("--fold", default=0)
@click.option("--nfolds", default=5)
@click.option("--relerr", default=0.01)
@click.option("--seed", default=123456)
def test_idea_cmd(dname, model_type, fold, nfolds, seed):
    d = prada.get_dataset(dname, nfolds=nfolds, seed=seed)
    d.scale_target() # only for regression datasets

    # Do what you need to do here...

if __name__ == "__main__":