This is the GitHub repository for the Copenhagen BioHackathon 2021
Tanya (GitHub: latticetower)
Sven (GitHub: sklumpe)
Thomas (GitHub: ngthomas)
Michael (GitHub: MSBradshaw)
Here, we look at two data sets of single of multi mutant sequences, their wildtype secondary structure and biophysical descriptors to predict stability scores of both single and multiple mutations within the protein sequence.
We've decided that we want to focus on features and want to start with a strong baseline. We've chosen catboost. It gives good results out of the box, it is interpretable, feature scaling doesn't affect its performance - all because it uses gradient boosting on decision trees. It also works ok with categorical features. After that we were trying to deal with the data 🙂 and improve our models.
Our baseline solution uses N-fold cross-validation, for each fold we train the model and use the trained models as an ensemble. We average their predictions to get the final prediction.
The performance of our current best solution for single mutants test file is shown on the plot:
Reproducible code can be found at ipynb notebook.
The performance of our current best solution for multiple mutants test file is shown on the plot:
Reproducible code can be found at ipynb notebook.
At the hackathon we've managed to dig into original paper's supplementary and found pdb structures for all non-mutated proteins. Many of them appear to have 0-3 aminoacids from N-terminus missing there. We've computed contact maps and designed baseline with it, however didn't completely finished the code and didn't run it.
Unfinished code can be found at ipynb notebook; the ugly process of data checking and contact maps extraction can be found at another notebook - this one run on your own risk.
Please also see the our team's BioLib page for further information
The code belongs to us, the data belongs to Baker lab and their collaborators (currently I'm not sure about its licensing).