/tape-neurips2019

Tasks Assessing Protein Embeddings (TAPE), a set of five biologically relevant semi-supervised learning tasks spread across different domains of protein biology. (DEPRECATED)

Primary LanguagePythonMIT LicenseMIT

Tasks Assessing Protein Embeddings (TAPE)

Note: This is the code associated with our original paper and benchmark. To view our new pytorch code, see https://github.com/songlab-cal/tape.

Data, weights, and code for running the TAPE benchmark on a trained protein embedding. We provide a pretraining corpus, five supervised downstream tasks, pretrained language model weights, and benchmarking code.

Contents

Paper

Preprint is available at https://arxiv.org/abs/1906.08230.

Data

Data should be placed in the ./data folder, although you may also specify a different data directory if you wish.

The supervised data is around 120MB compressed and 900 MB uncompressed. The unsupervised Pfam dataset is around 5GB compressed and 40GB uncompressed. The data for training is hosted on AWS. By default we provide data as TFRecords - see ./tape/data_utils/ for deserializers for each dataset and documentation of data keys. If you wish to download all of TAPE, run download_data.sh to do so. We also provide links to each individual dataset below in both TFRecord format and JSON format.

TFRecord Data

Pretraining Corpus (Pfam) | Secondary Structure | Contact (ProteinNet) | Remote Homology | Fluorescence | Stability

Raw Data

Raw data files are stored in JSON format for maximum portability. These are larger than the serialized TFRecord files (on average 3x larger). For all tasks except proteinnet we directly provide the output of our TFRecord parsing function on the file. For the proteinnet task we don't directly provide contact maps (as these massively increase the size of the files) and instead provide the 3D positions of all Carbon-alpha atoms. Note that this is in fact what is stored in the TFRecord files as well - our parsing function constructs the contact maps from this information on-the-fly.

Pretraining Corpus (Pfam) | Secondary Structure | Contact (ProteinNet) | Remote Homology | Fluorescence | Stability

Pretrained Models

We provide weights for all models pretrained as detailed in the paper. Each set of weights comes in an h5 file and is roughly 100 MB. If you wish to download all models, run download_pretrained_models.sh to do so. We also provide links to each individual model's weights below:

LSTM | Transformer | ResNet | UniRep (mLSTM) | Bepler Only Unsupervised | Bepler Unsupervised + MultiTask

UniRep is described in Alley et al. Bepler refers to the models in Bepler et al.

Code Setup

We recommend that you install tape into a python virtual environment using

$ pip install -e .

Usage

tape uses Sacred to configure and store logging information.

Sacred options are specified by running tape with <args>. For example, to run the transformer model on the masked_language_modeling task, simply run

$ tape with model=transformer tasks=masked_language_modeling

Additional arguments can be specified by adding e.g. transformer.n_layers=6, training.learning_rate=1e-4, gpu.device=0,1,2, etc.

Global arguments are defined under @tape.config in tape/__main__.py. Model specific arguments (e.g. transformer.n_layers) can be found in the corresponding model file (tape/models/Transformer.py).

Loading a Model

There are two ways of loading a model, depending on whether you want to load the unsupervised pre-training weights or the supervised task-specific weights. Loading unsupervised weights is done by passing the argument load_from=</path/to/unsupervised_weights.h5>. Loading supervised weights is done by passing the argument load_task_from=</path/to/supervised_weights.h5>.

Saving Results

Results will be stored in results/. Each run will be placed in a timestamped directory. All tape sources will automatically be saved, along with the config and per-epoch metrics.

Running the trained Task Model

Once you've trained your task model, you can run an evaluation step like this, passing your test set (as tfrecords) to --datafile:

$ tape-eval results/<task-name>-<model>-<time-stamp>/ --datafile data/remote_homology/remote_homology_test_fold_holdout.tfrecord

This will report the key accuracy metric on your dataset, as well as save the outputs of the model to results/<task-name>-<model>-<time-stamp>/outputs.pkl for more detailed analysis

List of Models and Tasks

The available models are:

  • transformer
  • resnet
  • lstm
  • bepler
  • unirep
  • one_hot
  • one_hot_evolutionary

The available standard tasks are:

  • contact_map
  • stability
  • fluorescence
  • language_modeling
  • masked_language_modeling
  • remote_homology
  • secondary_structure

Additionally, we have some Unirep and Bepler specific tasks:

  • bepler_language_modeling: for unsupervised pre-training of the Bepler model.
  • unidirectional_language_modeling: for unsupervised pre-training of the Unirep model.

Finally we also provide the netsurf task, which does the full multi-task Netsurf training described in the original paper. This is done on the same dataset as secondary structure.

The available models and tasks can be found in tape/models/ModelBuilder.py and tape/tasks/TaskBuilder.py.

Extracting Embeddings

If you would just like the embeddings for a list of proteins,

$ tape-embed <filename>.fasta --model <model> --load-from <pretrained-weights>.h5

You can also go from tfrecords, by first converting your fasta to tfrecords

$ tape-serialize <filename>.fasta

This will create a new serialized file in the same directory <filename>.tfrecord. You can then extract the embeddings with

$ tape-embed <filename>.tfrecord --model <model> --load-from <pretrained-weights>.h5

Which will create a file outputs.pkl in your current directory with the list of embeddings.

Extending Tape

If you'd like to extend tape directly, it is certainly possible. Unfortunately sacred, while great for storing and logging results, is a little bit of a pain when it comes to modularity and extensibility. That being said, it is reasonably straightforward to add a Keras model into the training framework. For examples on how to do this, see the examples/ directory, which contains two files showing how to add a model, and how to add a model with hyperparameters. If there is a more specific example that would be helpful, please open an issue and we will try to add it if we have time.

Leaderboard

We will soon have a leaderboard available for tracking progress on the core five TAPE tasks, so check back for a link here. See the main tables in our paper for a sense of where performance stands at this point. Publication on the leaderboard will be contingent on meeting the following citation guidelines.

In the meantime, here's a temporary leaderboard for each task. All reported models on this leaderboard use unsupervised pretraining.

Secondary Structure

Ranking Model Accuracy (3-class)
1. One Hot + Alignment 0.80
2. LSTM 0.75
2. ResNet 0.75
4. Transformer 0.73
4. Bepler 0.73
4. Unirep 0.73
7. One Hot 0.69

Contact Prediction

Ranking Model L/5 Medium + Long Range
1. One Hot + Alignment 0.64
2. Bepler 0.40
3. LSTM 0.39
4. Transformer 0.36
5. Unirep 0.34
6. ResNet 0.29
6. One Hot 0.29

Remote Homology Detection

Ranking Model Top 1 Accuracy
1. LSTM 0.26
2. Unirep 0.23
3. Transformer 0.21
4. Bepler 0.17
4. ResNet 0.17
6. One Hot + Alignment 0.09
6. One Hot 0.09

Fluorescence

Ranking Model Spearman's rho
1. Transformer 0.68
2. LSTM 0.67
2. Unirep 0.67
4. Bepler 0.33
5. ResNet 0.21
6. One Hot 0.14

Stability

Ranking Model Spearman's rho
1. Transformer 0.73
1. Unirep 0.73
1. ResNet 0.73
4. LSTM 0.69
5. Bepler 0.64
6. One Hot 0.19

Citation Guidelines

If you find TAPE useful, please cite our corresponding paper. Additionally, anyone using the datasets provided in TAPE must describe and cite all dataset components they use. Producing these data is time and resource intensive, and we insist this be recognized by all TAPE users. For convenience,data_refs.bib contains all necessary citations. We also provide each individual citation below.

TAPE (Our paper):

@article{
author = {Rao, Roshan and Bhattacharya, Nicholas and Thomas, Neil and Duan, Yan and Chen, Xi and Canny, John and Abbeel, Pieter and Song, Yun S},
title = {Evaluating Protein Transfer Learning with TAPE},
year = {2019}
}

Pfam (Pretraining):

@article{
author = {El-Gebali, Sara and Mistry, Jaina and Bateman, Alex and Eddy, Sean R and Luciani, Aur{\'{e}}lien and Potter, Simon C and Qureshi, Matloob and Richardson, Lorna J and Salazar, Gustavo A and Smart, Alfredo and Sonnhammer, Erik L L and Hirsh, Layla and Paladin, Lisanna and Piovesan, Damiano and Tosatto, Silvio C E and Finn, Robert D},
doi = {10.1093/nar/gky995},
file = {::},
issn = {0305-1048},
journal = {Nucleic Acids Research},
keywords = {community,protein domains,tandem repeat sequences},
number = {D1},
pages = {D427--D432},
publisher = {Narnia},
title = {{The Pfam protein families database in 2019}},
url = {https://academic.oup.com/nar/article/47/D1/D427/5144153},
volume = {47},
year = {2019}
}

SCOPe: (Remote Homology and Contact)-

@article{
  title={SCOPe: Structural Classification of Proteins—extended, integrating SCOP and ASTRAL data and classification of new structures},
  author={Fox, Naomi K and Brenner, Steven E and Chandonia, John-Marc},
  journal={Nucleic acids research},
  volume={42},
  number={D1},
  pages={D304--D309},
  year={2013},
  publisher={Oxford University Press}
}

PDB: (Secondary Structure and Contact)

@article{
  title={The protein data bank},
  author={Berman, Helen M and Westbrook, John and Feng, Zukang and Gilliland, Gary and Bhat, Talapady N and Weissig, Helge and Shindyalov, Ilya N and Bourne, Philip E},
  journal={Nucleic acids research},
  volume={28},
  number={1},
  pages={235--242},
  year={2000},
  publisher={Oxford University Press}
}

CASP12: (Secondary Structure and Contact)

@article{
author = {Moult, John and Fidelis, Krzysztof and Kryshtafovych, Andriy and Schwede, Torsten and Tramontano, Anna},
doi = {10.1002/prot.25415},
issn = {08873585},
journal = {Proteins: Structure, Function, and Bioinformatics},
keywords = {CASP,community wide experiment,protein structure prediction},
pages = {7--15},
publisher = {John Wiley {\&} Sons, Ltd},
title = {{Critical assessment of methods of protein structure prediction (CASP)-Round XII}},
url = {http://doi.wiley.com/10.1002/prot.25415},
volume = {86},
year = {2018}
}

NetSurfP2.0: (Secondary Structure)

@article{netsurfp,
  title={NetSurfP-2.0: Improved prediction of protein structural features by integrated deep learning},
  author={Klausen, Michael Schantz and Jespersen, Martin Closter and Nielsen, Henrik and Jensen, Kamilla Kjaergaard and Jurtz, Vanessa Isabell and Soenderby, Casper Kaae and Sommer, Morten Otto Alexander and Winther, Ole and Nielsen, Morten and Petersen, Bent and others},
  journal={Proteins: Structure, Function, and Bioinformatics},
  year={2019},
  publisher={Wiley Online Library}
}

ProteinNet: (Contact)

@article{
  title={ProteinNet: a standardized data set for machine learning of protein structure},
  author={AlQuraishi, Mohammed},
  journal={arXiv preprint arXiv:1902.00249},
  year={2019}
}

Fluorescence:

@article{
  title={Local fitness landscape of the green fluorescent protein},
  author={Sarkisyan, Karen S and Bolotin, Dmitry A and Meer, Margarita V and Usmanova, Dinara R and Mishin, Alexander S and Sharonov, George V and Ivankov, Dmitry N and Bozhanova, Nina G and Baranov, Mikhail S and Soylemez, Onuralp and others},
  journal={Nature},
  volume={533},
  number={7603},
  pages={397},
  year={2016},
  publisher={Nature Publishing Group}
}

Stability:

@article{
  title={Global analysis of protein folding using massively parallel design, synthesis, and testing},
  author={Rocklin, Gabriel J and Chidyausiku, Tamuka M and Goreshnik, Inna and Ford, Alex and Houliston, Scott and Lemak, Alexander and Carter, Lauren and Ravichandran, Rashmi and Mulligan, Vikram K and Chevalier, Aaron and others},
  journal={Science},
  volume={357},
  number={6347},
  pages={168--175},
  year={2017},
  publisher={American Association for the Advancement of Science}
}