PraDa automatically downloads (only OpenML for now) and locally stores its fine selection of datasets on your computer in the HDF5 format for faster loading times.
Use this environment variable to inform PraDa where to store these HDF5 datasets:
export PRADA_DATA_DIR=/your/data/cache/directory
PraDa is currently not a PyPI package, but you can install it straight from Github (https://pip.pypa.io/en/stable/topics/vcs-support/)
pip install git+https://github.com/laudv/prada.git
List all the datasets:
import prada
prada.ALL_DATASETS
prada.ALL_DATASET_NAMES
prada.ALL_BINARY
prada.ALL_REGRESSION
prada.ALL_MULTICLASS
Load a dataset using the class name:
d = prada.Phoneme()
d.load_dataset()
d.X.shape # (5404, 5)
d.y.shape # (5404,)
Load a dataset using a name of a dataset:
d = prada.get_dataset("Phoneme")
Load a multiclass dataset and turn it into a binary dataset by comparing only two classes:
d = prada.Mnist()
d.load_dataset()
d2v4 = d.one_vs_other(2, 4)
# or
d = prada.get_dataset("Mnist[2v4]")
Similarly, turn a regression dataset into a (binary) classification dataset:
d = prada.get_dataset("WineQuality[bin]")
# or
d = prada.WineQuality()
d.load_dataset()
d.to_binary(frac_positive=0.5)
For more of these functions, have a look at the RegressionMixin
and
MulticlassMixin
mixins.
Iterate over a grid of parameters:
d = prada.Spambase()
d.load_dataset()
param_dict = {"n_estimators": [10, 20], "eta": [0.5, 0.9] }
for i, params in enumerate(d.paramgrid(**param_dict)):
print(i, params)
This prints:
0 {'n_estimators': 10, 'eta': 0.5}
1 {'n_estimators': 10, 'eta': 0.9}
2 {'n_estimators': 20, 'eta': 0.5}
3 {'n_estimators': 20, 'eta': 0.9}
Train a model for a given parameter set:
dtrain, dtest = d.train_and_test_fold(fold)
dtrain, dvalid = dtrain.train_and_test_fold(0)
# `model_class` can be any sklearn compatible classifier.
# There is built-in support for
# - rf: sklearn RandomForest
# - xgb: xgboost
# - lgb: lightgbm
model_type = "xgb" # or "rf", "lgb"
model_class = d.get_model_class(model_type)
clf, train_time = dtrain.train(model_class, params)
mtrain = dtrain.metric(clf)
mtest = dtest.metric(clf)
mvalid = dvalid.metric(clf)
d = prada.Banknote(nfolds=5, seed=5232, silent=True)
d.name() # Banknote
d.source # openml
d.openml_id # only for openml datasets
d.url # only for openml datasets
d.is_regression() # False
d.is_binary() # True
d.is_multiclass() # False
d.astype(np.float32) # cast d.X and d.y
d.minmax_normalize() # sklearn.MinMaxScaler
d.robust_normalize() # sklearn.RobustScaler
d.scale_target() # for regression problems
# Metric: RMSE for regression, Accuracy for classification
# either evaluates a given classifier on `d.X` ...
d.metric(clf: sklearn_compatible)
d.metric(at: veritas.AddTree)
# ... or just applies the relevant metric to the given values
d.metric(ytrue, ypred)
import prada
import click
@click.group()
def cli():
pass
@cli.command("my_command")
@click.argument("dname")
@click.option("-m", "--model_type", type=click.Choice(["xgb", "rf", "lgb"]),
default="rf")
@click.option("--fold", default=0)
@click.option("--nfolds", default=5)
@click.option("--relerr", default=0.01)
@click.option("--seed", default=123456)
def test_idea_cmd(dname, model_type, fold, nfolds, seed):
d = prada.get_dataset(dname, nfolds=nfolds, seed=seed)
d.load_dataset()
d.robust_normalize()
d.scale_target() # only for regression datasets
# Do what you need to do here...
if __name__ == "__main__":
cli()