This is a modified version of VAMBN from GitHub. Its purpose is applying the method to the DONALD study data.
For educational and testing purposes, a small dummy data file is also included: data/fruit-data.csv
This repo includes modified scripts for both the fruit and the DONALD data.
There are also scripts for plotting and other useful evaluations of the data available: evaluation_scripts/
Last used with:
- tensorflow 1.14.0
- Keras 2.3.1 (only required for LSTM branches)
To start, a suitable data file is required. The fruit data CSV is included, but the private DONALD data can not be part of this repository. To generate the usable DONALD data CSV from a DONALD SAS file, use donald_loader.py
To run the pipeline, execute the following files, where ??? is your dataset (fruit
or donald
).
clean_format_???.R
- Defines variable groups and standalone variables
impute_aux_???.R
- Reformats data and finds missingness
HI-VAE/1_???_HIVAE_training.py
(make sure to set sample_size)- Trains a HI-VAE for every module (16 visits * 4 variable groups = 64 modules for DONALD)
- Reconstructs participants from the original participant encodings (real participants, RPs)
bnet_???.R
- Learns the BNet structure and parameters
- Draws virtual participant codes from the BNet
- For DONALD, it is suggested to use
make_bl_wl_donald.py
instead of the equivalent R script. To do this, runbnet_donald.R
untildata_names.csv
is written, stop it and run the BL/WL script, then rerun the BNet script. - If the Tabu search produces errors, it is suggested to also try switching from
tabu
to thehc
algorithm (for both "Final bayesian network" and "Bootstrapped network")
HI-VAE/2_???_HIVAE_VP_decoding_and_counterfactuals.py
(set sample_size)- Reconstructs drawn encodings (virtual participants, VPs)
To reformat the RPs or VPs into the original DONALD format, use donald_writer.py
If you use the code, pease cite:
Paper link
@misc{kühnel2023synthetic,
title={Synthetic data generation for a longitudinal cohort study -- Evaluation, method extension and reproduction of published data analysis results},
author={Lisa Kühnel and Julian Schneider and Ines Perrar and Tim Adams and Fabian Prasser and Ute Nöthlings and Holger Fröhlich and Juliane Fluck},
year={2023},
eprint={2305.07685},
archivePrefix={arXiv},
primaryClass={stat.ME}
}