Learning protein fitness models from evolutionary and assay-labelled data

This repo is a collection of code and scripts for evaluating methods that combine evolutionary and assay-labelled data for protein fitness prediction.

For more details, please see our pre-print Combining evolutionary and assay-labelled data for protein fitness prediction.

Repo contents
System requirements
Installation
Demo
Jackhmmer search
Fitness data
Density models
Predictors

Repo contents

There are several main components of the repo.

data: Processed protein fitness data. (Only one example data set is provided here due to GitHub repo size constraints. Please download all alignments from Dryad doi:10.6078/D1K71B.)
alignments: Processed multiple sequence alignments. (Only one example alignment is provided here due to GitHub repo size constraints. Please download all alignments from Dryad doi:10.6078/D1K71B.)
scripts: Bash and Python scripts for data collection and data analysis.
src: Python code for training and evaluating the methods assessed in the paper. Also includes the evaluation and comparison framework of different predictors.
environment.yml: Software dependencies for conda environment.

When running the provided scripts, the outputs will be written to the following directories:

inference: Directory for intermediate files such as inferred sequence log likelihoods.
results: Directory for results as csv files.

System requirements

Hardware requirements

Some of the methods, in particular DeepSequence VAE, UniRep mLSTM, and ESM Transformer, require GPU for training and inference. The GPU code in this repo has been tested on NVIDIA Quadro RTX 8000 GPU.

Evaluating all the methods each with 20 random seeds, 19 data sets, and 10 training setups would require a relatively long time on a single core. Our evaluation code supports multiprocessing and has been tested on 32 cores.

For storing all intermediate files for all methods and all data sets, approximately 100G of disk space will be needed.

Software requirements

The code has been tested on Ubuntu 18.04.5 LTS (Bionic Beaver) with conda 4.10.0 and Python 3.8.5. The (optional) slurm scripts have been tested on slurm 17.11.12. The list of software dependencies are provided in the environment.yml file.

Installation

Create the conda environment from the environment.yml file:

    conda env create -f environment.yml

Activate the new conda environment:

    conda activate protein_fitness_prediction

Install the plmc package:

    cd $HOME (or use another directory for plmc <directory_to_install_plmc> and
modify `scripts/plmc.sh` accordingly with the custom directory)
    git clone https://github.com/debbiemarkslab/plmc.git
    cd plmc
    make all-openmp

The installation should finish in a few minutes.

Demo

The one-hot linear model is the simplest example as it only requires assay-labelled data. To evaluate the one-hot linear model on the Poly(A)-binding protein (PABP) data with 240 training examples and 20 seeds on a single core:

    python src/evaluate.py BLAT_ECOLX_Ranganathan2015-2500 onehot --n_seeds=20 --n_threads=1 --n_train=240

When the program finishes, the results from the 20 runs will be available in the file results/BLAT_ECOLX_Ranganathan2015-2500/results.csv.

As another example that involves both evolutionary and assay-lablled data, here we show the process to evaluate the augmented Potts model on the same protein.

The multiple sequence alignments (MSAs) are available in the alignments directory for all proteins used in our assessment. For other proteins, MSAs can be retrieved by jackhmmer search (see the jackhmmer search section).

From the MSA, first run PLMC to estimate the couplings model:

    bash scripts/plmc.sh BLAT_ECOLX BLAT_ECOLX_Ranganathan2015-2500

The resulting models are saved at inference/BLAT_ECOLX_Ranganathan2015-2500/plmc.

Then, similar to the one-hot linear model evaluation, run:

    python src/evaluate.py BLAT_ECOLX_Ranganathan2015-2500 ev+onehot --n_seeds=20 --n_threads=1 --n_train=240

The evaluation should finish in a few minutes, and all results will be saved to results/BLAT_ECOLX_Ranganathan2015-2500/results.csv.

Here, ev+onehot refers to the augmented Potts model. Other models and data sets can also be similarly evaluated as long as the corresponding prerequisite files are present in the inference directory.

Jackhmmer search

Downloaded UniRef100 in fasta format from UniProt.
Index the uniref100 fasta file into ssi with

    esl-sfetch --index <seqdb>.fasta

Set the file location of the fasta file in scripts/jackhmmer.sh.
To run jackhmmer, use scripts/jackhmmer.sh to search the local fasta file. In addition to running jackhmmer search, the script also implicitly calls the other file conversion scripts. For example, it extracts target ids from the jackhmmer tabular output by calling scripts/tblout2ids.py; converts the fasta output to list of sequences by scripts/fasta2txt.py; and splits the sequences into train and validation with scripts/randsplit.py.
The outputs of the jackhmmer script will be in jackhmmer/<dataset>/<run_id>, where each iteration's alignment is saved as iter-<N>.a2m and the final alignment is saved as alignment.a2m. The list of full length target sequences is at target_seqs.fasta and target_seqs.txt.

Fitness data

In the example data set in the data directory (and also for all other data sets available on Dryad), each subdirectory (e.g. data/BLAT_ECOLX_Ranganathan2015-2500) represents a data set of interest. In the subdirectory, there are two key files.

wt.fasta documents the WT sequence.
data.csv contains three columns: seq, log_fitness, n_mut. seq is the sequence with mutation, and should be the same length as WT seq. log_fitness is the log enrichment ratio or other log-scale fitness values, where higher is better. Although referred to as log_fitness here, this corresponds to fitness in the paper. n_mut is how many mutations away the sequence is from WT, where 0 indicates WT.

Density models

Potts model

For learning a Potts model (EVmutation / plmc) from an MSA, see scripts/plmc.sh. The resulting couplings model files (saved to the inference directory) can be directly parsed by our correpsonding ev and ev+onehot predictors.

DeepSequence VAE

Install the DeepSequence package.
Put the DeepSequence package directory as WORKING_DIR in both src/train_vae.py and src/inference_vae.py.
Use scripts/train_vae.sh for training a VAE model from an MSA.
For retrieving ELBOs from VAEs, see scripts/inference_vae.sh.
The saved elbo files in the inference directory can be parsed by the corresponding vae and vae+onehot predictors.

ESM

Follow the instructions from the ESM repo to download the pre-trained model weights.
Put the downloaded pre-trained weights location into scripts/inference_esm.sh.
To retrieve ESM Transformer approximate pseudo-log likelihoods for sequences in a fasta file, see scripts/inference_esm.sh. The results will be in the inference directory and can be used by the esm and esm+onehot predictors.

UniRep

Download the pre-trained UniRep weights (1900-unit) from the UniRep repo.
Put the location for the downloaded weights into scripts/evotune_unirep.sh.
Use the scripts/evotune_unirep.sh script to evotune the UniRep model with an MSA file as seqspath.
Use scripts/inference_unirep.sh to calculate log-likelihoods from an evotuned unirep model.

Predictors

Each type of predictor is represented by a Python class in src/predictors. A predictor class represents a prediction strategy for protein fitness that depends on evolutionary data, assay-labelled data, or both. The base predictor class, BasePredictor is defined at src/predictors/base_predictors.py. All predictor classes inherit from this class and have the train and predict methods. The JointPredictor class is a meta predictor that combines the features from multiple existing predictor classes, and can be easily specified by the sub-predictor names. See src/predictors/__init__.py for a full list of implemented predictors.

Ced3-han/combining-evolutionary-and-assay-labelled-data