/Hackathon-2021

This repository stores the codes and results in BioHackathon-2021, aiming to predict the stability of mutated proteins.

Primary LanguageJupyter NotebookMIT LicenseMIT

Hackathon-2021

This repository stores the codes and results in BioHackathon-2021, aiming to predict the stability of multiple mutated proteins.

The descriptions of this Hackathon can be found here.

The data comes from this article https://www.bakerlab.org/wp-content/uploads/2017/12/Science_Rocklin_etal_2017.pdf

This our team page. Team members: Jiajun He, Zelin Li.

An outline of our work and results can be found here.

A more detailed description is shown below.


1 Description of the task

  • The main task for us is to use the amino acids sequence of mini-protein (43 a.a. length) and their secondary structure information to predict their structural stability.
  • The inputs are a.a. sequence (20 kinds of standard a.a., and non-standard, in total 21 kinds) and secondary structure (E, T, H; 3 kinds) sequence.
  • This is a regression task and will output the stability score change of the mutated mini-protein (a score proportion to ΔG).

2 An outline of our work

We perform 2 kinds of models: Simple Machine Learning Methods and Complex Deep Learning Models with transformer and LSTMs.

2.1 Simple Machine Learning Methods

One-hot encoding for Amino acids and for secondary structures. MLP, RF and SVM to perform the regression.

Here is the notebook for there models.

2.2 Deep Learning Models with transformer and LSTMs

We first got latent embedding for each amino acids by transformer(ESM-1b pretrained Model). Then we built RNN with LSTMs to get the prediction.

The overall structure is:

image

We also used Transformer to predict contact map, and combined it in using attention mechanisms. But we found that there is no very significant improvement(only 0.001 higher than Model 2; more details can be found in the "NoteBook for testing"), so just to keep the model simple, we do not use the contact map for our final results.

Here are the notebooks for these models:

Model 1: Transformer+LSTM using SS(Our Final Model for Testing)

Model 2: Transformer+LSTM without SS

Model 3: Transformer+LSTM+Contact Map

NoteBook for testing

* Some weights are retrained, so there are some slight differences from the results below. But the overall trends and conclusions are the same.

3 Results and Plots

3.1 Correlection Coefficient and Plots for each Model:

Model Correlation Coefficient for Single Mutation Correlation Coefficient for Multiple Mutation
MLP 0.8451 0.3177
RF 0.8136 0.3827
SVM 0.8350 0.4089
Transformer Embedding + LSTMs 0.8912 0.5940

Plots on Test data:

Single Mutation: image Multiple Mutations: image

3.2 The necessity of Secondary Structure

Besides, we explored the necessity of Secondary Structure, and actually found that it is unnecessary for our task.

We bulid 2 models, one with SS, one without SS, results are as follows:

image

4 Conclusions and Discussions

  • Better feature engineering yields better results. Transformer embedding is better than simple one-hot ecoding in our task.

  • Multiple mutation data is harder to predict than single mutation data, especially those protein with a negative score.

  • Secondary structure is almost redundant for our task.

    • Possible Reasons:
      • We have only 4 original sequences. So our task can be seen as 4 individual regressions, and the secondary structure only serves as a category label.
      • If the original energies are thought to be similar, then all the information is stored in the mutated amino acid sequence.

5 Future Plans

  • Fine tuning for each dataset respectively.

  • Better feature engineering, e.g., considering the chemical properties of amino acids.

  • Better architecture, e.g., transfer Learning by Transformer, etc.

  • Use more proteins to collect mutation data. (better from different organisms and environments)

Bibliography

Rocklin, G. J.. Global analysis of protein folding using massively parallel design, synthesis, and testing. Science 357, 168–175 (2017).

Rives, A.. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. (2019). doi:10.1101/622803

Rao, R. M., Meier, J., Sercu, T., Ovchinnikov, S. & Rives, A.. Transformer protein language models are unsupervised structure learners. (2020). doi:10.1101/2020.12.15.422761