/tabular_dae

Primary LanguagePythonApache License 2.0Apache-2.0

Denoise AutoEncoder For Tabular Data

Overview | Installation | Quickstart | Documentation | Credit

Github License

Overview

Denoise AutoEncoder(DAE)

DAE is an AutoEncoder model trained to perform denoise task. The model takes a partially corrupted input data and outputs the cleaned data.

Through the denoising task, the model learns the input distribution and produces latent representations that are robust to corruptions. The latent representations extracted from the model can be useful for a variety of downstream tasks. One can:
1. Use the latent representations to train supervised ML models, renders DAE as a vehicle for automatic feature engineering.
2. Use the latent representations for unsupervised tasks like similarity query or clustering.

Applying Denoise AutoEncoder to Tabular data

To train DAE on tabular data, the most important piece is the noise generator. What makes sense and most effective is swap noise, through which, each value in the training data maybe replaced by a random value from the same column.

What's included

This package implements:
1. Swap Noise generator.
2. Dataframe parser which converts arbitrary pandas dataframe to numpy arrays.
3. DAE network constructor with configurable body blocks.
4. DAE training function.
5. Sklearn style .fit, .transform API.
6. Sklearn style model also supports save and load.

Installation

tabular_dae is built with pyTorch. Make sure to install the dependencies listed in requirements.txt. Then install the package using pip:

# download the requirements.txt file
pip install -r requirements.txt
pip install git+https://github.com/ryancheunggit/tabular_dae

Quickstart

import pandas as pd
from tabular_dae import DAE


# read data
df = pd.read_csv(<path-to-csv-file>)

# initialize a dae model
dae = DAE(
    body_network='deepstack',
    body_network_cfg=dict(hidden_size=1024),
    swap_noise_probas=.15,
    device='cuda',
)  

# fit the model
dae.fit(df, verbose=1, optimizer_params={'lr': 3e-4})

# extract latent representation with the model
latent = dae.transform(df)

Credit

@software{tabular_dae2021nielseniq,
  author = {Ren Zhang},
  title = {Denoise AutoEncoder for Tabular Data},
  url = {https://github.com/ryancheunggit/tabular_dae},
  version = {0.2},
  year = {2021},
}