The code is based on Dr.Tsubaki's repository here: https://github.com/masashitsubaki/CPI_prediction, which has the implementation of the paper "Compound-protein Interaction Prediction with End-to-end Learning of Neural Networks for Graphs and Sequences (Bioinformatics, 2018)". It uses Graph Neural Network for learning the representations of compounds and Convolutional Neural Networks for proteins.
In this repository, the CPI datasets of human and C. elegans are created by Liu et al. in this paper: "Improving compound–protein interaction prediction by building up highly credible negative samples (Bioinformatics, 2015)."
In the problem setting of CPI prediction, an input is the pair of a SMILES format of compound and an amino acid sequence of protein; an output is a binary label (interact or not). The SMILES is converted with RDKit and we obtain a 2D graph-structured data of the compound (i.e., atom types and their adjacency matrix). Representation of proteins and compounds learned by 2 sub-networks are concatenated to predict the interaction.
In this repository, I implement Protein Transformer for learning protein representations. It takes the advantage of self-attention and can capture the long-distance interaction between amino acid residues. This brings about significant improvements in prediction power.
- PyTorch
- scikit-learn
- RDKit
- numpy=1.16.1