The repository contains scripts to replicate findings in the paper Gomari, Schweickart et al., "Variational autoencoders learn universal latent representations of metabolomics data"
Clone local copy of git repository
git clone https://github.com/krumsieklab/mtVAE
(or use a git GUI client of your choice)
In a terminal, switch to the directory of the local git repository.
conda env create --force --file environment.yml
- Open
mtvae.Rproj
- Run
R_setup.R
Note: this has been tested under R version 4.0.0 and RStudio version 1.3.1073
conda activate mtvae_env
jupyter notebook
- Place datasets into data/.
- With access to TwinsUK, Type 2 diabetes (T2D), schizophrenia, acute myeloid leukemia (AML) data, run scripts in increasing order of the file prefixes, starting from 01_train_VAE.ipynb.
- Notes:
- Pre-trained models from 01_train_VAE.ipynb can be found under models/
- All R scripts should be run from within RStudio
- 00_optimize_VAE_hyperparameters.ipynb can be skipped and should only be used as a guide to select hyperparameters.
Name | Description |
---|---|
00_optimize_KPCA_hyperparameters.ipynb | Optimize for KPCA hyperparameters using TwinsUK train data. (Runtime: ~2h on a MacBook pro) |
00_optimize_VAE_hyperparameters.ipynb | Optimize for VAE hyperparameters using TwinsUK train data. (Runtime: 1h15m on a MacBook pro) |
01_train_VAE.ipynb | Train VAE model on TwinsUK data and calculate evaluation metrics. Note that this requires access to TwinsUK, which should be requested separately from https://twinsuk.ac.uk/. |
02_reconstruct_data.ipynb | Generate TwinsUK data reconstructions using trained VAE, PCA, and KPCA models. Used for model performance assessments. |
03_assess_model_performance.R | Compute mean squared error (MSE) and correlation matrix MSE (CM-MSE) for VAE, PCA, and KPCA. This includes the calculation of MSE and CM-MSE for varying latent space dimensionality d. |
04_calculate_SAGE_values_VAE.ipynb | Calculate VAE SAGE values using TwinsUK test data. This script should be parallelized, due to its long runtime. Pre-computed VAE SAGE values can be found under results/sage_values. (Runtime: if all instances are parallelized ~7.5h) |
04_calculate_SAGE_values_PCA.ipynb | Calculate PCA SAGE values using TwinsUK test data. Pre-computed PCA SAGE values can be found under results/sage_values. (Runtime: if all instances are parallelized ~1.5h) |
04_calculate_SAGE_values_KPCA.ipynb | Calculate KPCA SAGE values using TwinsUK test data. Pre-computed KPCA SAGE values can be found under results/sage_values. (Runtime: if all instances are parallelized ~6h) |
05_interpret_latent_space.R | Create SAGE value heatmaps and alluvial plots for VAE, PCA, and KPCA. |
06_encode_data.ipynb | Generate type 2 diabetes (T2D), schizophrenia, and acute myeloid leukemia (AML) data encodings using VAE, PCA, and KPCA models. |
07_associate_dimensions_with_diseases.R | Associate VAE, PCA, and KPCA encodings with patient groups from T2D, schizophrenia, and AML data. This includes T2D clinical variables (e.g. HbA1c %) and AML mutations. |
Name | Description |
---|---|
models.py | Contains VAE, PCA, and KPCA model classes. |
metric_functions.py | Functions used for model assessment in python can be found here. |
helper_functions.R | R functions that are required for the calculation of evaluation results and the construction of plots can be found here. |