This repository contains the analysis code, links to the data, and pretrained models for the paper "Learning mutational semantics" by Brian Hie, Ellen Zhong, Bryan Bryson, and Bonnie Berger, which appeared as a poster at NeurIPS 2020.
For a more biologically-oriented follow-up work, including analysis of SARS-CoV-2 viral sequences, see our paper "Learning the language of viral evolution and escape".
You can download the relevant datasets (including training and validation data) using the commands
wget http://cb.csail.mit.edu/cb/viral-mutation/data.tar.gz
tar xvf data.tar.gz
within the same directory as this repository.
The major Python package requirements and their tested versions are in requirements.txt.
Our experiments were run with Python version 3.7 on Ubuntu 18.04.
To run the experiments below, download the data (instructions above). Our experiments require a maximum of 400 GB of CPU RAM and 32 GB of GPU RAM (though often much less); in silico escape model inference can take around 35 minutes for influenza HA and 90 minutes for HIV Env.
Headline part-of-speech changes and WordNet changes can be evaluated with the command
python bin/parse_headline_mods.py results/headlines/semantics_1024.log.gz \
> headline_pos.log 2>&1
Generating headline changes can be done with the command
python bin/headlines.py bilstm --checkpoint data/headlines.hdf5 --semantics \
> semantics.log 2>&1 &
Influenza HA semantic embedding UMAPs and log files with statistics can be generated with the command
python bin/flu.py bilstm --checkpoint models/flu.hdf5 --embed \
> flu_embed.log 2>&1
Single-residue escape prediction using validation data from Doud et al. (2018) and Lee et al. (2019) can be done with the command
python bin/flu.py bilstm --checkpoint models/flu.hdf5 --semantics \
> flu_semantics.log 2>&1
Training a new model on flu HA sequences can be done with the command
python bin/flu.py bilstm --train --test \
> flu_train.log 2>&1
HIV Env semantic embedding UMAPs and log files with statistics can be generated with the command
python bin/hiv.py bilstm --checkpoint models/hiv.hdf5 --embed \
> hiv_embed.log 2>&1
Single-residue escape prediction using validation data from Dingens et al. (2019) can be done with the command
python bin/hiv.py bilstm --checkpoint models/hiv.hdf5 --semantics \
> hiv_semantics.log 2>&1
Training a new model on HIV Env sequences can be done with the command
python bin/hiv.py bilstm --train --test \
> hiv_train.log 2>&1
For questions about the pipeline and code, contact brianhie@mit.edu. We will do our best to provide support, address any issues, and keep improving this software. And do not hesitate to submit a pull request and contribute!