A probabilistic database approach to autoencoder-based data cleaning

This is the source code for the paper "A probabilistic database approach to autoencoder-based data cleaning".

Installation

Simply install all the packages in requirements.txt (pip install -r requirements.txt). If using Linux, install.sh can be used to generate a venv.

This repository requires tensorflow 2.5 to be installed, as we use experimental features.

Instructions for usage

clean_one_file.py allows for cleaning a .csv file using the methods we described. takes 2 or 3 arguments: the file you want to clean, the name you want the cleaned file to have, and an optional filename of a previously trained autoencoder (usually a .h5) file that it will use for this cleaning. If the autoencoder filename is not specified, the program will train an autoencoder by itself and save it to the same folder as the cleaned .csv.

Example usage (delete the last arguments if you want to train an autoencoder from scratch):

python3 clean_one_file.py input_data/surgical_case_durations.csv output_data/cleaned_db.csv "output_data/JSDu, SD=4    rows    10000/model.h5"

run_experiments.py contains the code used to generate the results. When it is started, it will start creating many different autoencoders and PDB combinations for data cleaning, constantly saving new results into the results folder. It will also save the dictionary containing experiment configurations into the "output_data/experiments" file using dill. Turning the USE_GPU flag to True will ensure that the GPU is used. We do not recommend this, as training was usually faster on the CPU.
figures/plot_results.ipynb can be used to generate the figures used in the paper. This requires the results folder to be populated with results first, and for an "experiments" file in the output_data folder (with the experiment configurations stored using dill) to exist. Confidence intervals only start appearing from n=2 measurements per configuration.
results/merge_results.ipynb contains code to merge results stored in multiple .csv files (which happens when running experiments on multiple devices". Make sure they are stored within different folders such as "results_laptop" and "result_desktop" and change the lsit of suffixes in "mergelist" accordingly

Other files

The "src" folder contains methods used to generate the autoencoder, bayesian network of the underlying PDB, and many other helper methods
The "output_data" folder contains the databases, ground truth PDB, noisy PDB, cleaned PDB and cleaned database of all the experiments that were executed since cloning this repository. It also contains all the trained autoencoders of each experiment as .h5 files.
The "databases_used_in_paper" folder contains a similar set of databases that we used for the tables in the paper.
The "figures" folder contains the .svg files of the figures shown in the paper.
The "input_data" folder contains data we used to generate some of the figures and tables in the paper. You can add your own data here, as long as it is a .csv
The "figures_wasserstein" includes exploratory results (mentioned in the Conclusion section) for future work: the Wasserstein distance as a loss function and a performance metric. One of the more interesting figures is "figures_wasserstein/results_fig_sampling_density_plot_1.svg": which shows how the Wasserstein distance as a loss function leads to much better results for high sampling densities, not leading to as much of a decrease in performance (when performance is measured using the Wasserstein distance) when compared to the JSD. "figures_wasserstein/results_fig_sampling_density_plot_2.svg" shows that the MSE reduction for the Wasserstein distance is almost 100% at almost any sampling density, much higher than when using the JSD loss function.

fpjnijweide/autoencoder-pdb-cleaning

A probabilistic database approach to autoencoder-based data cleaning

Installation

Instructions for usage

Other files