/dini

[Nature-SR'22] DINI: Data Imputation using Neural Inversion

Primary LanguagePythonBSD 3-Clause "New" or "Revised" LicenseBSD-3-Clause

DINI: Data Imputation using Neural Inversion for Edge Applications

Python Version Conda PyTorch Hits

DINI is a tool to impute tabular multi-input/multi-output data that can have features that are continuous, categorical, or a combination thereof. DINI takes in data with missing values, and iteratively imputes it while training a surrogate model that could be leveraged for downstream tasks. This facilitates machine learning with corrupted/missing data by state-of-the-art imputation. It works with any dataset and any PyTorch model.

Table of Contents

Environment setup

Clone this repository

git clone --recurse-submodules https://github.com/jha-lab/dini.git
cd dini

Setup python environment

The python environment setup is based on conda. The script below creates a new environment named dini or updates an existing environment on the macOS-arm64 platform:

source setup/env_step.sh

For any other platform, you can use the environment files. For pip installation:

pip install --requirement setup/requirements.txt

For conda installation:

conda env create --file setup/environment.yaml
conda activate dini

Replicating results

To generate corrupt data:

python3 corrupt.py --dataset <dataset> --strategy <strategy>

where <dataset> can either be breast, diabetes, diamonds, energy, flights, or yacht. The flag <strategy> can be any one of MCAR, MAR, MNAR, MSAR, or MPAR.

To run DINI model:

python3 dini.py --model <model> --dataset <dataset> --retrain

where <model> can either be FCN, FCN2, LSTM2, or TXF2. The one used in the paper is FCN2. To model uncertainties using an MC dropout layer, use the flag --model_unc. You can also define the fraction to start imputing on using --impute_fraction <fracion>, where <fraction> is a number between 0 and 1 (see Table 3 in the paper).

To run imputation using all baselines, including DINI:

python3 impute.py --dataset <dataset> --strategy <strategy>

To run surrogate modeling on imputed data, for three case studies:

python3 model.py --dataset <case_dataset> --strategy <strategy>

where <case_dataset> can either be gas, swat, or covid_cxr. Note that swat dataset is not public and will have to be downloaded into data/swat/ directory. To do this, get access to the dataset using this link. Then, save SWaT_Dataset_Attack_v0.csv to data/swat/ directory.

Hacking DINI

To run any PyTorch model, you can modify the src/models.py file. See examples (namely models FCN, FCN2, LSTM2, or TXF2) in that file. To use any dataset, convert it to a data.csv file, placed in data/<dataset> directory. Then, the following lines can be added to the process function in corrupt.py:

elif dataset == <dataset>:
	def split(df):
		return df.iloc[:, :-<out_col>].values, df.iloc[:, -<out_col>:].values

where <dataset> is the name of the dataset, and <out_col> is the number of output columns in the chosen dataset.

Developer

Shikhar Tuli. For any questions, comments or suggestions, please reach me at stuli@princeton.edu.

Cite this work

Cite our work using the following bitex entry:

@article{tuli2022sr,
      title={{DINI}: Data Imputation using Neural Inversion for Edge Applications}, 
      author={Tuli, Shikhar and Jha, Niraj K.},
      journal={Scientific Reports},
      volume={12},
      pages={20210},
      year={2022},
      publisher={Nature Publishing Group}
}

License

BSD-3-Clause. Copyright (c) 2021, Shikhar Tuli and JHA-Lab. All rights reserved.

See License file for more details.