mtVAE

The repository contains scripts to replicate findings in the paper Gomari, Schweickart et al., "Variational autoencoders learn universal latent representations of metabolomics data"

Requirements:

1. git repository

Clone local copy of git repository

git clone https://github.com/krumsieklab/mtVAE

(or use a git GUI client of your choice)

2. Environment setup

Setup python environment

In a terminal, switch to the directory of the local git repository.

conda env create --force --file environment.yml

R setup

Open mtvae.Rproj
Run R_setup.R

Note: this has been tested under R version 4.0.0 and RStudio version 1.3.1073

3. Activating the conda environment to access jupyter notebooks

conda activate mtvae_env

jupyter notebook

4. Instructions for running the scripts

Place datasets into data/.
With access to TwinsUK, Type 2 diabetes (T2D), schizophrenia, acute myeloid leukemia (AML) data, run scripts in increasing order of the file prefixes, starting from 01_train_VAE.ipynb.

Notes:
- Pre-trained models from 01_train_VAE.ipynb can be found under models/
- All R scripts should be run from within RStudio
- 00_optimize_VAE_hyperparameters.ipynb can be skipped and should only be used as a guide to select hyperparameters.

Name	Description
00_optimize_KPCA_hyperparameters.ipynb	Optimize for KPCA hyperparameters using TwinsUK train data. (Runtime: ~2h on a MacBook pro)
00_optimize_VAE_hyperparameters.ipynb	Optimize for VAE hyperparameters using TwinsUK train data. (Runtime: 1h15m on a MacBook pro)
01_train_VAE.ipynb	Train VAE model on TwinsUK data and calculate evaluation metrics. Note that this requires access to TwinsUK, which should be requested separately from https://twinsuk.ac.uk/.
02_reconstruct_data.ipynb	Generate TwinsUK data reconstructions using trained VAE, PCA, and KPCA models. Used for model performance assessments.
03_assess_model_performance.R	Compute mean squared error (MSE) and correlation matrix MSE (CM-MSE) for VAE, PCA, and KPCA. This includes the calculation of MSE and CM-MSE for varying latent space dimensionality d.
04_calculate_SAGE_values_VAE.ipynb	Calculate VAE SAGE values using TwinsUK test data. This script should be parallelized, due to its long runtime. Pre-computed VAE SAGE values can be found under results/sage_values. (Runtime: if all instances are parallelized ~7.5h)
04_calculate_SAGE_values_PCA.ipynb	Calculate PCA SAGE values using TwinsUK test data. Pre-computed PCA SAGE values can be found under results/sage_values. (Runtime: if all instances are parallelized ~1.5h)
04_calculate_SAGE_values_KPCA.ipynb	Calculate KPCA SAGE values using TwinsUK test data. Pre-computed KPCA SAGE values can be found under results/sage_values. (Runtime: if all instances are parallelized ~6h)
05_interpret_latent_space.R	Create SAGE value heatmaps and alluvial plots for VAE, PCA, and KPCA.
06_encode_data.ipynb	Generate type 2 diabetes (T2D), schizophrenia, and acute myeloid leukemia (AML) data encodings using VAE, PCA, and KPCA models.
07_associate_dimensions_with_diseases.R	Associate VAE, PCA, and KPCA encodings with patient groups from T2D, schizophrenia, and AML data. This includes T2D clinical variables (e.g. HbA1c %) and AML mutations.

Other files

Name	Description
models.py	Contains VAE, PCA, and KPCA model classes.
metric_functions.py	Functions used for model assessment in python can be found here.
helper_functions.R	R functions that are required for the calculation of evaluation results and the construction of plots can be found here.