GRIMP
GRIMP is a data imputation system that imputes missing values in dirty datasets by relying on Graph Neural Networks and attention.
main_multilabel.py
is the script to use to run the code.
main_corruption.py
is used to generate dataset that contain missing values according to certain rules.
run.sh
is a batch script that is used to run experiments on all datasets provided in the script.
Installation
We strongly recommend to run the code in a conda environment.
Additional packages
The required packages are listed in environment.yaml
.
Create a new conda environment:
conda env create -f environment.yaml
Then activate the new environment:
conda activate grimp
Installing PyTorch
Install PyTorch following the instructions relative to your platform, as explained on the website.
If you have access to GPUs, GRIMP can make use of them to improve the execution time. Refer to the official documentation to install the GPU version of PyTorch.
Running the code
Example configuration
python
main_multilabel.py
--ground_truth ground/truth/path.csv # Path of the clean file
--dirty_dataset dirty/dataset/path # Path of the dirty file
--gnn_feats 64 # Number of features in the GNN
--h_feats 64 # Number of hidden features in the classifier
--loss xe # cross-entropy loss
--epochs 150 # number of epochs to train the model for
--grace 400 # number of epochs guaranteed to train for before early stopping can trigger
--dropout_clf 0.2 # dropout in classifier
--max_components 64 # dimension reduction of pretrained embeddings
--text_emb pretrained/embeddings/path.emb # path to pretrained embeddings
--head_model attention # using attention in the classifier heads
--shared_model linear # using simple linear modules in the shared layer
--learning_rate 0.001
--save_imputed_df # the system will output the imputed dataset in results/imputed_datasets
--cat_columns cols_to_convert # list of numerical columns that should be treated as categorical (e.g. IDs, ZIPs)
--fd_path fd/file/path.txt # path to the fd file, if present
Preparing pretrained embeddings
The pretrained embeddings can be generated by using the prepare_pretrained_embeddings.py
script
in the main folder.
This script is expecting the fasttext commoncrawl pretrained embedding corpus,
available at https://fasttext.cc/docs/en/english-vectors.html
.
It will then list the files in the directory data/to_pretrain
and generate embeddings for each of them, saving them
in the directory data/pretrained-emb
. The dataset in to_pretrain
should be the exact same datasets that will be used
in the training procedure as "dirty datasets".