/EVEscape

Official repository for the paper "Learning from pre-pandemic data to forecast viral antibody escape"

Primary LanguagePythonMIT LicenseMIT

EVEscape

This is the official code repository for the paper "Learning from pre-pandemic data to forecast viral antibody escape". This paper is a joint collaboration between the Marks Lab and the OATML group.

Overview

EVEscape is a model that computes the predicted likelihood of a given viral protein variant to induce immune escape from antibodies. For each protein, EVEscape predicts escape from data sources available pre-pandemic: sequence likelihood predictions from broader viral evolution, antibody accessibility information from protein structures, and changes in binding interaction propensity from residue chemical properties.

Usage

Computing EVEscape scores consists of three components:

  1. Fitness: use scores from EVE, an unsupervised generative model of mutation effect from broader evolutionary sequences
  2. Accessibility: calculate WCN from PDB structures of relevant conformations of the viral protein of interest
  3. Dissimilarity: calculate difference in charge and hydrophobicity between the mutant residue and the wildtype

The components are then standardized and fed into a temperature scaled logistic function, and we take the the log transform of the product of the 3 terms to obtain final EVEscape scores.

We also provide EVEscape scores for all single mutation variants of SARS-CoV-2 Spike and aggregate strain-level predictions for all PANGO lineages in our paper, and EVEscape rankings of newly occurring GISAID strains and visualization of likely future mutations will be available at evescape.org.

Example scripts

The scripts folder contains python scripts to calculate EVEscape scores for all single mutations and aggregate deep mutational scanning data for SARS-CoV-2 RBD, Flu HA, and HIV Env from data. Specifically this includes the following two scripts:

Data requirements

The data required to obtain EVEscape scores is one or multiple PDB files, EVE scores (see next subsection) and a fasta file of the wildtype sequence for the viral protein of interest.

To download the RBD escape data used in this project (~120MB unzipped):

curl -o escape_dms_data_20220109.zip https://marks.hms.harvard.edu/evescape/escape_dms_data_20220109.zip
unzip escape_dms_data_20220109.zip
rm escape_dms_data_20220109.zip

(originally downloaded from SARS2_RBD_Ab_escape_maps)

Generating EVE scores

We leverage the original EVE codebase to compute the evolutionary indices used in EVEscape.

Model training

The MSAs used to train the EVE models used in this project can be found in the supplemental material of the paper (Data S1).

We modify the Bayesian VAE training script to support the following hyperparameter choices in the MSA_processing call:

  • sequence re-weighting in MSA (theta): we choose a value of 0.01 that is better suited to viruses (Hopf et al., Riesselman et al.)
  • fragment filtering (threshold_sequence_frac_gaps): we keep sequences in the MSA that align to at least 50% of the target sequence.
  • position filtering (threshold_focus_cols_frac_gaps): we keep columns with at least 70% coverage, except for SARS-CoV-2 Spike for which we lower the required value to 30% in order to maximally cover experimental positions and significant pandemic sites.

We train 5 independent models with different random seeds.

Model scoring

For the 5 independently-trained models, we compute evolutionary indices sampling 20k times from the approximate posterior distribution (ie., num_samples_compute_evol_indices=20000). We then average the resulting scores across the 5 models to obtain the final EVE scores used in EVEscape.

License

This project is available under the MIT license.

Reference

If you use this code, please cite the following paper:

Nicole N. Thadani*, Sarah Gurev*, Pascal Notin*, Noor Youssef, Nathan J. Rollins, Chris Sander, Yarin Gal, Debora S. Marks. Learning from pre-pandemic data to forecast viral antibody escape. BioRxiv. 2022.

(* equal contribution)

Links: