feature_transform

Build ColumnTransformers (Scikit or DaskML) for feature transformation by specifying configs.

For quickly building PyTorch models, see also TorchArc.

Installation

pip install feature_transform

Installing this will also install Scikit Learn, but if you need parallelization, install Dask ML separately:

pip install dask-ml

Usage

The ColumnTransformer class of Scikit / DaskML allows us to build a complex pipeline of feature preprocessors/transformers that takes dataframe as input and outputs numpy arrays. However, using it requires maintaining Python code.

This project started with the vision of building the entire feature transformation pipeline by just specifying what preprocessors to apply to a dataframe's column.

For example, take the iris dataset with columns: sepal length (cm), sepal width (cm), petal length (cm), petal width (cm), target. We want the first 4 columns to be the features for our input x, where each feature goes through a StandardScaler; and target to be the feature of our output y, where it is one-hot encoded. Then, use this directly to fit_transform the iris dataframe and obtain numpy arrays xs, ys. Here's the code:

from feature_transform import transform
from sklearn import datasets
import pandas as pd


# specify transform for each feature
spec = {
    'dataset': {
        'transform': {'module': 'sklearn', 'n_jobs': 1}
    },
    'transform': {
        'x': { # the "mode"
            'sepal length (cm)': {'StandardScaler': None}, # the column name and its {preprocessor: kwargs, ...}
            'sepal width (cm)': {'StandardScaler': None},
            'petal length (cm)': {'StandardScaler': None},
            'petal width (cm)': {'StandardScaler': None},
        },
        'y': {
            'target': {'OneHotEncoder': {'sparse': False, 'handle_unknown': 'ignore'}}
        }
    }
}

# load iris dataframe
data_df = pd.concat(datasets.load_iris(return_X_y=True, as_frame=True), axis=1)
# transform into numpy arrays ready for model
mode2data = transform.fit_transform(spec, stage='fit', df=data_df)
xs, ys = mode2data['x'], mode2data['y']

# to reload the fitted transformers for validation/test, specify stage='validate' or 'test'
val_df = data_df.copy()
mode2val_data = transform.fit_transform(spec, stage='validate', df=val_df)
val_xs, val_ys = mode2val_data['x'], mode2val_data['y']

# artifacts to get the column transformers and transformed names directly
artifacts = transform.get_artifacts(spec)
artifacts['mode2col_transfmr']
# {'x': ColumnTransformer(n_jobs=1, sparse_threshold=0, transformers=[('sepal length (cm)', Pipeline(steps=[('standardscaler',...

artifacts['mode2transformed_names']
# {'x': ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'],
#  'y': ['target_0', 'target_1', 'target_2']}

What happens in the background is as follows:

for each mode in spec.transform
- for each column in mode, create a pipeline of [preprocessor(**kwargs)], and compose them into a ColumnTransformer for the mode.
- during fit_transform, each mode runs its ColumnTransformer.fit_transform
- then it saves the fitted ColumnTransformer to ./data/{hash}-{mode}-col_transfmr.pkl.
- these filenames will be logged. These files are the ones loaded in transform.get_artifacts for uses such as test/validation.

Using YAML config

The goal of this library is to make feature transform configuration, so let's do the same as above, but with a YAML config file. The spec format is:

dataset:
  transform:
    module: {str} # options: 'sklearn' (serial-row) or 'dask_ml' (parallel-row)
    n_jobs: {null|int} # parallelization; -1 to use all cores
transform:
  {mode}:
    {column}:
      {preprocessor}: {null|kwargs} # optional kwargs for preprocessor
      {preprocessor}: {null|kwargs}
      ...

The {preprocessor} value can be any of the preprocessor classes Scikit or DaskML. Additional custom ones are also registered in feature_transform/transform.py.

For example, the earlier spec can be rewritten in YAML as:

# transform.yaml
dataset:
  transform:
    module: sklearn
    n_jobs: null
transform:
  x:
    sepal length (cm):
      StandardScaler:
    sepal width (cm):
      StandardScaler:
    petal length (cm):
      StandardScaler:
    petal width (cm):
      StandardScaler:
  y:
    target:
      OneHotEncoder:
        sparse: false
        handle_unknown: ignore

Now, our code simplifies to:

from feature_transform import transform, util
from sklearn import datasets
import pandas as pd


# convenient method to read YAML
spec = util.read('transform.yaml')
# load iris dataframe
data_df = pd.concat(datasets.load_iris(return_X_y=True, as_frame=True), axis=1)
# transform into numpy arrays ready for model
mode2data = transform.fit_transform(spec, stage='fit', df=data_df)
xs, ys = mode2data['x'], mode2data['y']

# to reload the fitted transformers for validation/test, specify stage='validate' or 'test'
val_df = data_df.copy()
mode2val_data = transform.fit_transform(spec, stage='validate', df=val_df)
val_xs, val_ys = mode2val_data['x'], mode2val_data['y']

Chain Preprocessors

To chain multiple preprocessors, simply add more steps:

dataset:
  transform:
    module: sklearn
    n_jobs: null
transform:
  x:
    sepal length (cm):
      Log1pScaler: # custom preprocessor for np.log1p
      StandardScaler:
    sepal width (cm):
      Clipper: # custom preprocessor to clip values
        a_min: 0
        a_max: 10
      StandardScaler:
    petal length (cm):
      StandardScaler:
    petal width (cm):
      StandardScaler:
  y:
    target:
      OneHotEncoder:
        sparse: false
        handle_unknown: ignore

Specify any module

By default the config refers to classes in the preprocessing module of sklearn/dask-ml. Use dot-notation to specify other modules:

dataset:
  transform:
    module: sklearn
    n_jobs: null
transform:
  x:
    a_float_column:
      StandardScaler:
    a_column_with_dict_values:
      feature_extraction.DictVectorizer:
    a_column_with_na:
      StandardScaler:
      impute.SimpleImputer: # handle na values
        strategy: constant
        fill_value: -1
  y:
    a_target_column:
      Identity:

Specify any modes

The modes can be any names other than x, y:

dataset:
  transform:
    module: sklearn
    n_jobs: null
transform:
  foo:
    column_foo_1:
      StandardScaler:
    column_foo_2:
      Log1pScaler:
      StandardScaler:
  bar:
    column_bar_1:
      OneHotEncoder:
  baz:
    column_baz_1:
      Identity:

Parallelization

NOTE run pip install dask-ml first.

dataset:
  transform:
    module: dask_ml
    n_jobs: -1 # use all cores
transform:
  # ...

ML Examples

PyTorch DataLoader

from feature_transform import transform, util
from sklearn import datasets
from torch.utils.data import TensorDataset, DataLoader
import pandas as pd
import torch


spec = util.read('transform.yaml')
# load iris dataframe
data_df = pd.concat(datasets.load_iris(return_X_y=True, as_frame=True), axis=1)
# transform into numpy arrays ready for model
mode2data = transform.fit_transform(spec, stage='fit', df=data_df)
xs, ys = mode2data['x'], mode2data['y']

train_dataset = TensorDataset(torch.from_numpy(xs), torch.from_numpy(ys)) # create your datset
train_dataloader = DataLoader(train_dataset) # create your dataloader

# suppose this is test/validation set; use stage='validate' or stage='test' to transform
val_df = data_df.copy()
mode2val_data = transform.fit_transform(spec, stage='validate', df=val_df)
val_xs, val_ys = mode2val_data['x'], mode2val_data['y']
val_dataset = TensorDataset(torch.from_numpy(val_xs), torch.from_numpy(val_ys))
val_dataloader = DataLoader(val_dataset) # create your dataloader

Scikit Learn example

from feature_transform import transform, util
from sklearn import datasets, metrics
from sklearn.tree import DecisionTreeClassifier
import pandas as pd


spec = util.read('transform.yaml')
# load iris dataframe
data_df = pd.concat(datasets.load_iris(return_X_y=True, as_frame=True), axis=1)
# transform into numpy arrays ready for model
mode2data = transform.fit_transform(spec, stage='fit', df=data_df)
xs, ys = mode2data['x'], mode2data['y']

# train model
model = DecisionTreeClassifier(max_depth = 3, random_state = 1)
model.fit(xs, ys)
pred_ys = model.predict(xs)
print(f'train accuracy: {metrics.accuracy_score(pred_ys, ys):.3f}')
# train accuracy: 0.973

# suppose this is validation/test data, we use stage='validate' or 'test
test_df = data_df.copy()
mode2test_data = transform.fit_transform(spec, stage='test', df=test_df)
test_xs, test_ys = mode2val_data['x'], mode2val_data['y']
pred_ys = model.predict(test_xs)
print(f'test accuracy: {metrics.accuracy_score(pred_ys, test_ys):.3f}')
# test accuracy: 0.973

Development

Setup

# install the dev dependencies
bin/setup
# activate Conda environment
conda activate transform

Unit Tests

python setup.py test

kengz/feature_transform