/datasets

Scripts for downloading, preprocessing, and numpy-ifying popular machine learning datasets

Primary LanguagePythonBSD 3-Clause "New" or "Revised" LicenseBSD-3-Clause

Machine Learning Datasets

Machine Learning Datasets (mlds) is a repo for downloading, preprocessing, and numpy-ifying popular machine learning datasets. Across projects, I commonly found myself rewriting the same lines of code to standardize, normalize, or other-ize data, encode categorical variables, parse out subsets of features, among other transformations. This repo provides a simple interface to download, parse, transform, and clean datasets via flexible command-line arguments. All of the information you need to start using this repo is contained within this one ReadMe, ordered by complexity (No need to parse through any ReadTheDocs documentation).

Table of Contents

Datasets

Integrating a data source into mlds requires writing a simple module within the mlds.downloaders package to download the dataset (by implementing a retrieve function). Afterwards, add the module to the package initializer and the dataset is ready to be processed by mlds. Notably, the only requirement is that retrieve should return a dict containing dataset partitions as keys and pandas DataFrames as values. At this time, the following datasets are supported:

Quick Start

This repo was designed to be interoperable with the following models and attacks repos. I recommend installing an editable version of this repo via pip install -e. Afterwards, you can download, rescale, and save MNIST so that features are in [0, 1] (via a custom UniformScaler transformer, which scales all features based on the maximum and minimum feature values observed across the training set) with the dataset referenced as mnist via:

mlds mnist -f all UniformScaler --filename mnist

You can download, rescale, and save the NSL-KDD so that categorical features are one-hot encoded (via scikit-learn OneHotEncoder) while the remaining features (all can be used as an alias to select all features except those that are to be one-hot encoded) are individually rescaled to be in [0, 1] (via scikit-learn MinMaxScaler), with numeric label encoding (via scikit-learn LabelEncoder), with the dataset referenced as nslkdd_OneHotEncoder_MinMaxScaler_Label_Encoder (when filename is not specified, the file name defaults to the name of the dataset concatenated with the transformations) via:

mlds nslkdd -f protocol_type,service,flag OneHotEncoder -f all MinMaxScaler -l LabelEncoder

Afterwards, import the dataset filename to load it:

>>> import mlds
>>> from mlds import mnist
>>> mnist
mnist(samples=(60000, 10000), features=(784, 784), classes=(10, 10), partitions=(train, test), transformations=(UniformScaler), version=c88b3d6)
>>> mnist.train
train(samples=60000, features=784, classes=10)
>>> mnist.train.data
array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]], dtype=float32)
>>> mnist.train.labels
array([4., 1., 0., ..., 6., 1., 5.], dtype=float32)

Other uses can be found in the examples directory.

Advanced Usage

Below are descriptions of some of the more subtle controls within this repo and complex use cases.

  • API: Should you use to interface with mlds outside of the command line, the main entry point for the repo is the process function within the datasets module. The docstring contains all of the necessary details surrounding arguments and their required types. An example of interfacing with mlds in this way can be found in the dnn_prepare example script.

  • Cleaning: When a dataset is to be processed from the command line, it can be optionally cleaned by passing --destupefy as an argument. This applies a transformation, after all other transformations, via the Destupefier. The Destupifier is an experimental transformer that removes duplicate columns, duplicate rows, and single-valued columns. This is particularly useful when exploring data reduction techniques that may produce a large amount of irrelevant features or duplicate samples.

  • Downloading: If you wish to add a dataset that is not easily retrievable from another library (i.e, TensorFlow Datasets, Torchvision Datasets, pandas.read_csv, etc.), then you may find the download function within mlds.downloaders helpful. Specifically, given a tuple of URLs, mlds.downloaders.download leverages the Requests library to return a dictionary containing the requested resource as a bytes object.

  • Partitions: Following best practices, transformations are fit to the training set and transformed to all other partitions. A partition is considered to be a training set if its key is "train" in the data dictionary returned by dataset downloaders (This is commonly the case when downloading datasets through other frameworks such as Torchvision Datasets and TensorFlow Datasets). If the data dictionary does not contain this key, then all partitions are fitted and transformed separately.

  • Transformations: Currently, the following data transformations are supported:

    Adding custom transformations requires inheriting sklearn.base.BaseEstimator and sklearn.base.TransformerMixin and implementing fit, set_output, and transform methods. Alternatively, transformations from other libraries can be simply aliased into the transformations module.

Repo Overview

This repo was designed to allow easily download and manipulate data from a variety of different sources (which may have many deficiencies) with a suite of transformations. The vast majority of the legwork is done within the mlds.datasets.process method; once datasets are downloaded via downloader, then data transformations are composed and applied via ColumnTransformer. As described above, if "train" is a key in the dictionary representing the dataset, then the transformers are fit only to this partition.

Notably, the order of features is preserved, post-transformation (including expanding categorical features with one-hot vectors), and a set of metadata is saved with the transformed dataset, including: the names of the partitions, the transformations that were applied, the current abbreviated commit hash of the repo, the number of samples, the number of features, the number of classes, one-hot mappings (if applicable), and class mappings (if a label encoding scheme was applied). Metadata that pertains to the entire dataset (such as the transformations that were applied) is saved as a dictionary attribute for Dataset objects, while partition-specific information (such as the number of samples) is saved as an attribute for Partition objects (which are set as attributes in Dataset objects).