/nn4dms

Neural networks for deep mutational scanning data

Primary LanguageJupyter NotebookMIT LicenseMIT

Neural networks for deep mutational scanning data

GitHub Actions DOI

This repository is a supplement to our paper:
Neural networks to learn protein sequence-function relationships from deep mutational scanning data. Sam Gelman, Sarah A Fahlberg, Pete Heinzelman, Philip A Romero+, Anthony Gitter+. Proceedings of the National Academy of Sciences, 118:48, 2021.
+ denotes equal contribution.

We trained and evaluated the performance of multiple types of neural networks on five deep mutational scanning datasets. This repository contains code and examples that allow you to do the following:

  • Retrain the models from our publication
  • Train new models using our datasets or your own datasets
  • Use trained models to make predictions for new variants

Setup

This code is based on Python 3.6 and TensorFlow 1.14. Use the provided environment.yml file to set up a suitable environment with Anaconda.

conda env create -f environment.yml
conda activate nn4dms

Installation typically takes approximatley 5 minutes. Note these software versions differ slightly from the ones we used to train the models in our publication.

GPU support (optional)

By default, the environment uses CPU-only TensorFlow. If you have an NVIDIA GPU and want GPU support, use the environment_gpu.yml file instead. It will install tensorflow-gpu instead of tensorflow. You will need to make sure you have the appropriate CUDA drivers and software installed for TensorFlow 1.14: cudnn 7.4 and cudatoolkit 10.0. Certain versions of this NVIDIA software may also be available for your operating system via Anaconda. The GPU environment is not compatible with NVIDIA Ampere or newer microarchitectures.

Enrich2 (optional)

We used Enrich2 to compute functional scores for the GB1 and Bgl3 datsets. If you want to re-run that part of our pipeline, you must install Enrich2 according to the instructions on the Enrich2 GitHub page. Make sure the conda environment for Enrich2 is named "enrich2". This is optional; we provide pre-computed datasets in the data directory.

Training a model

You can train a model by calling code/regression.py with the required arguments specifying the dataset, network architecture, train-test split, etc. For convenience, regression.py accepts an arguments text file in addition to command line arguments. We provide a sample arguments file you can use as a template.

Call the following from the root directory to train a sample linear regression model on the avGFP dataset:

python code/regression.py @regression_args/example.txt 

The output, which includes the trained model, evaluation metrics, and predictions on each of the train/tune/tests sets, will automatically be placed in the training_logs directory. The linear regression example above trains in less than 5 minutes. Training time will be longer for larger datasets and more complex models.

For a full list of parameters, call python code/regression.py -h.

Additional customization:

  • To define your own custom network architecture, see the readme in the network_specs directory.
  • For more control over the train-test split, see the train_test_split.ipynb notebook.
  • To compute your own protein structure graph for the graph convolutional network, see the structure_graph.ipynb notebook.
  • To use your own dataset, see the readme in the data directory.

Retraining models from our publication

In the pub directory, we provide various files to facilitate retraining the models from our publication:

  • The exact train/tune/test set splits
  • Pre-made arguments files that can be fed directly into regression.py

To retrain one of our models, call python code/regression.py @pub/regression_args/<desired model> from the root directory.

We also provide pre-trained models, similar to the ones from our publication, that can be used to predict scores for new variants.

Evaluating a model and making new predictions

During training, regression.py saves a variety of useful information to the log directory, including predictions for all train, tune, and test set variants and the trained model itself.

Evaluating a model

We provide convenience functions that allow you to easily load and process this log information. See the example in the analysis.ipynb notebook. You can also use TensorBoard to visualize how model performance changes during the training process. For more information, see the readme in the training_logs directory.

Using a trained model to make predictions

For a straightforward example of how to use a trained model to make predictions, see the inference.ipynb notebook.

External sources

Our implementation of graph convolutional networks is based on the implementation used in Protein Interface Prediction using Graph Convolutional Networks. The original third-party code is available under the MIT License, Copyright © 2020 Alex Fout.