GRIMP

GRIMP is a data imputation system that imputes missing values in dirty datasets by relying on Graph Neural Networks and attention.

main_multilabel.py is the script to use to run the code.

main_corruption.py is used to generate dataset that contain missing values according to certain rules.

run.sh is a batch script that is used to run experiments on all datasets provided in the script.

Installation

We strongly recommend to run the code in a conda environment.

Additional packages

The required packages are listed in environment.yaml.

Create a new conda environment:

conda env create -f environment.yaml

Then activate the new environment:

conda activate grimp

Installing PyTorch

Install PyTorch following the instructions relative to your platform, as explained on the website.

If you have access to GPUs, GRIMP can make use of them to improve the execution time. Refer to the official documentation to install the GPU version of PyTorch.

Running the code

Example configuration

python
main_multilabel.py
--ground_truth ground/truth/path.csv # Path of the clean file
--dirty_dataset dirty/dataset/path # Path of the dirty file
--gnn_feats 64 # Number of features in the GNN
--h_feats 64 # Number of hidden features in the classifier
--loss xe # cross-entropy loss
--epochs 150 # number of epochs to train the model for
--grace 400 # number of epochs guaranteed to train for before early stopping can trigger
--dropout_clf 0.2 # dropout in classifier
--max_components 64 # dimension reduction of pretrained embeddings
--text_emb pretrained/embeddings/path.emb  # path to pretrained embeddings
--head_model attention # using attention in the classifier heads
--shared_model linear # using simple linear modules in the shared layer
--learning_rate 0.001
--save_imputed_df # the system will output the imputed dataset in results/imputed_datasets
--cat_columns cols_to_convert # list of numerical columns that should be treated as categorical (e.g. IDs, ZIPs)
--fd_path fd/file/path.txt # path to the fd file, if present

Preparing pretrained embeddings

The pretrained embeddings can be generated by using the prepare_pretrained_embeddings.py script in the main folder.