This package provides a toolbox for building generative and predictive probabilistic models of biological sequences, as described in Weinstein and Marks (2021). It is implemented in the probabilistic programming language Edward2, with a Tensorflow backend. Note that an implementation of the MuE in Pyro is also available as part of the Pyro package, see docs and examples.
To install the package, create a new python 3.7 virtual environment (eg. using conda) and run:
pip install "git+https://github.com/debbiemarkslab/MuE.git#egg=MuE[extras]" --use-feature=2020-resolver
pip install "git+https://github.com/debbiemarkslab/edward2.git#egg=edward2"
This shouldn't take more than a few minutes. You can find out more about the package requirements in the setup.py
file. We've provided a fork of the Edward2 repo since it does not yet have a stable release, but you can also install the newest version from the project repo https://github.com/google/edward2 to get the latest features.
To run the models at large scale, you will need to make sure you have access to a GPU with CUDA installed. See https://www.tensorflow.org/install/gpu for further support.
To run the example models, first clone this MuE repo and navigate to the models
directory.
Each model (FactorMuE.py
and RegressMuE.py
) is configured with a separate config file, such as examples/factor_config.cfg
(for FactorMuE.py
) and examples/regress_config.cfg
(for RegressMuE.py
). The config file controls the dataset, hyperparameters, training time, etc. Descriptions of all the options can be found in the config files themselves.
The output of each run, including plots of point estimates of key parameters, is currently configured to be saved in the subdirectory examples/logs
.
You can inspect the ELBO optimization curve by running tensorboard --logdir=./examples/logs
from the models
directory, which will initialize a tensorboard instance (https://www.tensorflow.org/tensorboard).
Once the models are trained, there are scripts for visualizing the results (visualize_FactorMuE_results.py
and visualize_RegressMuE_results.py
), which provide in particular the "shift in preference" plots shown in the paper.
To run the example FactorMuE model, on the example dataset examples/data1.fasta
, from the model
directory run
python FactorMuE.py examples/factor_config.cfg
The model should take roughly ~1-2 min. to finish the training run (100 epochs). In examples/logs
you should find a new folder, named with a timestamp corresponding the when the run began (eg. 20201003-144700 for Oct. 3, 2020, 2:47 pm). This contains:
- A summary of the run in
config.cfg
, which copies the input config file but also adds scalar results such as the heldout perplexity. Under the[results]
heading the entryheldout_perplex
should be roughly 2.0 in this example. - A tensorflow log file (the filename starts with
events.out
), which can be read by tensorboard. - Three plots, showing the mean and variance of the posterior latent representation (
z.pdf
; should show four points very close together) as well as the posterior mean mutation matrix (l.pdf
) and the posterior mean deletion and insertion parameters (ur.pdf
; all the points should be very close to zero in this example). - The detailed results in
results.dill
. This file is saved using thedill
package (https://dill.readthedocs.io/en/latest/dill.html). An illustration of how to load and analyze these results can be found invisualize_FactorMuE_results.py
.
To visualize the results of the model, we often want to project vectors from the latent space onto a reference sequence (this is described in mathematical detail in the article). The script visualize_FactorMuE_results.py
provides a tool for accomplishing this. Run the following command, replacing "####" with the timestamp of your output folder:
python visualize_FactorMuE_results.py examples/logs/####/config.cfg --z_plot=True --proj_shift=True --z_tail=[-0.15] --z_head=[-0.05] --seq_ref=examples/data1.fasta
z_tail
sets the position of the start of the latent space vector and z_head
sets the position of the end of the latent space vector. The first entry in the fasta file input to seq_ref
controls the reference sequence (in this case, ATAT). The output will be in a new time-stamped folder within the model output folder examples/logs/####
. You'll find plots of
- The projected tail and head of the vector, represented as a logo (
aligned_tail_logo.pdf
andaligned_head_logo.pdf
), and the difference between them (aligned_shift_logo.pdf
). - The magnitude of the shift in amino acid preference across the reference sequence (
aligned_shift_magnitude.pdf
), which is the nu vector described in the supplementary material of the paper. In this small example dataset, there isn't enough evidence for the model to find much change across the latent space, so the magnitude should be small (less than 0.1).
The RegressMuE.py
model works the same way. You can run the example dataset (with the covariate file data1_covariate.csv
):
python RegressMuE.py examples/regress_config.cfg
Then visualize the shift in nucleotide preference with changing covariates (replace "####" with the timestamp of your output folder):
python visualize_RegressMuE_results.py examples/logs/####/config.cfg --proj_shift=True --z_tail=[1,0] --z_head=[0,1] --covar_ref=examples/data1_covariate.csv --seq_ref=examples/data1.fasta
The resulting plot aligned_shift_logo
should show the nucleotide T gaining in probability at positions 1 and 3 of the logo, while the nucleotide C drops in probability about the same amount, revealing that sequences with covariate [0,1] are more likely to have a T and less likely to have a C at these positions when compared to sequences with covariate [1,0].
To run the FactorMuE on the example VE6 protein dataset analyzed in the paper, with a Laplace instead of Normal prior on the latent representation, you can run
python FactorMuE.py examples/ve6_factor_config.cfg
This will take a few hours on a GPU. You should find a heldout perplexity around 4.3 and a complex latent embedding space with multiple distinguishable subpopulations.
To run the FactorMuE or RegressMuE on your own data,
- Make a new version of the config file, and edit the
[data]
section. - Adjust the hyperparameters in the
[hyperp]
section. We suggest as defaults:
[hyperp]
bt_scale = 1.0
b0_scale = 1.0
u_conc = [100,1]
r_conc = [100,1]
l_conc = 2.0
latent_dims = 2
latent_length = auto
latent_alphabet_size = 25
z_distr = Normal
[train]
batch_size = 5
learning_rate = 0.01
- Set
max_epochs
to a small value (eg. 5 or 10) to start. - Check if you have access to a GPU and that tensorflow detects the GPU when the code starts running (this will show up in the initial output).
- Monitor the training curve using tensorboard.
- Be patient. It can take 1-2 hours to train each model on a dataset of 1000 sequences. Future versions of the
MuE
package will be faster. - Run multiple initializations of the model with different seeds; we recommend at least 3. You can choose among them based on the final ELBO, which approximates the marginal log likelihood, or the heldout log likelihood.
- If you start getting NaNs in the estimated ELBO, lower the learning rate and/or increase the batch size. Increasing
l_conc
can also help in these particular models. - If you don't find much structure in the FactorMuE latent space, you can increase
anneal_epochs
to be much larger thanmax_epochs
to force the model into the auto-encoding limit. - For this Edward2 implementation, GPUs typically run out of memory when working with sequences longer than about 275 amino acids. Longer sequences must currently be run on the CPU.
Writing new MuE observation models requires a familiarity with building and training probabilistic models in Edward2. There are four key utility functions in the mue
package:
make_transfer(M, dtype=tf.float64)
takes in the length of the ancestral sequence M and provides a set of constant transfer matrices (transfer_mats
) that are used to structure the MuE distribution. It should be run once, before the training loop.make_hmm_params(vxln, vcln, uln, rln, lln, transfer_mats, dtype=tf.float64)
takes in the log of the MuE parameters, along with the transfer matrix returned bymake_transfer
, and returns the parameters of a hidden Markov model (HMM). These parameters can be fed to the HMM distribution in tensorflow probability.hmm_log_prob(distr, x, xlen)
takes in a tensorflow probability HMM distribution, and returns the probability of generating a sequencex
of lengthxlen
(the sequence is represented as a one-hot and padded encoding) . It's an alternative to the built-in tensorflow probability HMM log_prob function, which was until very recently incorrect.encode
offers a utility for building amortized inference networks for MuE observation models with local latent variables; an example of its usage is in theFactorMuE.py
script.
To run the unit tests, navigate to the tests
directory of the repo and run
pytest
All tests should pass.