This is the code base for generating protein functional site predictions using the CATH Funsite predictor.
This git repository contains the following:
-
datasets
This contains all the datasets mentioned in the Funsite manuscript along with jupyter notebooks showing the steps for generating them. Some dataset files are very large, these can be downloaded from ftp://orengoftp.biochem.ucl.ac.uk/cath/supplementary-materials/2020-cath-funsite-predictor/ -
funsite_models
This contains the Funsite model generation, prediction scripts and benchmark results using the dataset files that are reported in the Funsite manuscript. -
scripts
This contains all scripts used for generating Funsite models and predictions.
virtualenv -p python3.6 FunsiteEnv
source FunsiteEnv/bin/activate
# install dependencies
pip install -r scripts/FunsiteEnv_requirements.txt
Note that in order to install XGBoost on Mac may require the installation of Xcode (developer tools) and Xcode Command Line Tools.
To get CATH FunFam assignments for proteins, one can use the online search tool for few sequences or the cath-genomescan tool for getting assignments for large sequence datasets.
This repository uses external bioinformatics tools that are not written and maintained by the authors of this project. If you use the results of these tools, please reference the relevant papers.
Characterization and Prediction of Residues Determining Protein Functional Specificity. Capra JA and Singh M (2008). Bioinformatics, 24(13): 1473-1480, 2008.
Scoring residue conservation. Valdar WSJ (2002) Proteins: Structure, Function, and Genetics. 43(2): 227-241, 2002.
Kabsch, W, and C Sander. 1983. ‘DSSP: Definition of Secondary Structure of Proteins given a Set of 3D Coordinates’. Biopolymers 22: 2577–2637.
Hubbard, S J. 1992. ‘NACCESS: Program for Calculating Accessibilities’. Department of Biochemistry and Molecular Biology, University College of London.
Laskowski, R A. 1995. ‘SURFNET: A Program for Visualizing Molecular Surfaces, Cavities, and Intermolecular Interactions’. J. Mol. Graph. 13 (5): 307-308,323-330.
Schymkowitz, Joost, Jesper Borg, Francois Stricher, Robby Nys, Frederic Rousseau, and Luis Serrano. 2005. ‘The FoldX Web Server: An Online Force Field’. Nucleic Acids Res. 33 (Web Server issue): W382-8.
Skjærven, Lars, Shashank Jariwala, Xin-Qiu Yao, Julien Idé, and Barry J Grant. 2016. ‘The Bio3D Project: Interactive Tools for Structural Bioinformatics’. Biophys. J. 110 (3): 379a.
Mihel, Josip, Mile Sikić, Sanja Tomić, Branko Jeren, and Kristian Vlahovicek. 2008. ‘PSAIA - Protein Structure and Interaction Analyzer’. BMC Struct. Biol. 8 (April): 21.
Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller, W. and Lipman, D.J., 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic acids research, 25(17), pp.3389-3402.
The most recent papers describing the CATH protein structure database and CATH FunFams: