python3 gen_fingerprint.py 2xni.pdb
Methods to obtain fingerprint for a protein-ligand complex.
Explore the docs »
View Demo
·
Report Bug
·
Request Feature
Table of Contents
This project demonstrates the implementation of the Protein Ligand Fingerprinting
which is a computational method used to analyze the interactions between proteins and small molecules (ligands). It involves the generation of a set of features or "fingerprints" that characterize the chemical and physical properties of both the protein and the ligand and are often used as input to various ML algorithms and also used to measure similarity between complexes. This project is built as a part of the course CS61060
Computation Biophysics: Algorithms to Applications at Indian Institute of Technology, Kharagpur. This project implements some basic fingerprinting methods which are as follows:
- Neighbourhood based fingerprinting
- obtains counts of N, CA, C, O and R atoms of protein in the neighbourhood of each ligand atom
- Encoded Neighbourhood based fingerprinting
- encodes the Neighbourhood based fingerprinting using a transformer model to obtain fixed length fingerprint
- Kmer based fingerprinting
- obtains fingerprintings based on presence/absence of k-mers
- MACCS key
- pre-existing fixed length ligand fingerprinting method
Following mentioned are the major frameworks/libraries used to bootstrap this project. Also included are the dependencies and addons used in this project.
- Python
- Numpy - The fingerprints are mainly stored as numpy vectors
- PyTorch - Mainly required for the Transformer model
- scikit-learn - Used for implementing RandomForestRegressor to predict binding affinity of protein-ligand complexes using our fingerprinting method
- DeepChem - Used to download pdbbind data for binding affinity of complexes
- Biopython - Used to parse PDB files
- RDKit - Used to generate SMILES for ligands
- Pandas - Used to store the data in a dataframe
- Matplotlib - Used to plot the graphs
- Seaborn - Used to plot the graphs
- tqdm - Used to display progress bars
- SciPy - Used to calculate the distance between atoms
Following are the details of the file structure of this project:
.
├── binding_affinity_prediction.py
├── data
│ ├── PDB
│ ├── pdbbind_core_df.csv.gz
│ ├── pdb_files
│ └── SMILES
├── fingerprint
│ ├── alphabets.py
│ ├── base.py
│ ├── __init__.py
│ ├── interactions.py
│ ├── kmer.py
│ ├── ligand.py
│ ├── neighbour.py
│ ├── parser.py
│ ├── transformer.py
│ └── utils.py
├── gen_fingerprint.py
├── images
│ └── protein.jpeg
├── LICENSE
├── models
│ └── AutoencoderTransformer_4.pt
├── output
├── README.md
├── requirements.txt
├── similarity.py
├── train.py
Following are the details of the file structure and their functionalities that are present in this code base.
-
fingerprint/parser.py - This file contains class implementation to represent a protein-ligand complex as an object after parsing a PDB file
Atom
- Class to store information for a single atom such as name, residue of which it is a part of, coordinates. etc.Protein
- Class to store protein as a sequence of atoms in chainsLigand
- Class to store ligand as a sequence of atomsProteinLigandSideChainComplex
- Class to store a protein-ligand complex as a combination of a Protein object and a Ligand object, where the Protein object doesn't store all atoms of a side chain, rather stores it as a single atom groupProteinLigandComplex
- Class to store a protein-ligand complex as a combination of a Protein object and a Ligand object
-
fingerprint/base.py - This file contains class implenetation of BaseFingerprint which serves as a base class for the original NeighbourFingerprint class
-
fingerprint/neighbour.py - This file contains the class implementation for our Neighbourhood based Fingerprinting scheme
NeighbourFingerprint
- derived from BaseFingerprint, this class obtains a fingerprint of length N*5 where N is the no of ligand atoms in the complex and dimension 5 comes for count of each N, CA, C, O and R, each entry denotes the count of the atom/group in certain radius of the ligand atom
-
fingerprint/alphabets.py - This file contains various AAR recoding schemes used in Kmer based fingerprinting
-
fingerprint/kmer.py - This file contains various class implementations for the K-mer based fingerprinting scheme
KmerBasis
- Class to store kmer basis set and perform basis set transforms, store kmer basis set and transform new vectors into fitted basisKmerSet
- Given alphabet and k, creates iterator for kmer basis setKmerVec
- generate kmer vectors by searching all kmer sets in the protein
-
fingerprint/transformer.py - This file contains transformer implementation to encode a neighbourhood based fingerprint into a fixed length vector
AutoencoderTransformer
- this model tries to encode a lengthy fingerprint into a fixed length vector using a Transformer based architecture.
-
fingerprint/utils.py - This file contains various utility functions that help in the fingerprinting process
-
data/PDB - folder to store/download to PDB files
-
data/SMILES - folder to store/download SMILES files
-
train.py - This file contains the code to train the AutoencoderTransformer model. It first generates the original Neighbourhood based fingerprinting for a set of ~190 protein-ligand complexes from PDBBind. It then feeds these fingerprints to the encoder of the Transformer. The task of the Transformer decoder is to decode such that the output features match as closely as possible to the encoder input. The fixed length encoding is obtained by taking mean of the encoder sequence and passing through a simple linear network followed by a sigmoid layer to obtain values in the range [0,1]
-
binding_affinity_prediction.py - This file uses the Encoded Neighbourhood based fingerprinting scheme and feeds the fingerprints to a RandomForestRegressor model so as to predict the binding affinity of a protein-ligand complex, which are then compared against standard PDBBind data
-
similarity.py - This file tries to study cosine-similarity patterns using the Encoded Neighbourhood based Fingerprinting scheme
-
gen_fingerprint.py - This is the main file that given as input any PDB-ID or PDB file, generates the 4 possible fingerprints which we have implemented
To get a local copy up and running follow these simple steps.
- Python
To run the code in this Assignment, one needs to have Go installed in their system. If it is not already installed.
In order to setup a local copy of the project, you can follow the one of the 2 methods listed below. Once the local copy is setup, the steps listed in Usage can be used to interact with the system.
Clone
the repogit clone https://github.com/debajyotidasgupta/Protein-Ligand-Fingerprinting.git
- Alternatively,
unzip
the attached submission zip file to unpack all the files included with the project.unzip <submission_file.zip>
- Change directory to the
Protein-Ligand-Fingerprinting
directorycd Protein-Ligand-Fingerprinting
- Create a
virtual environment
to install the required dependenciesvirtualenv venv or python3 -m venv venv
- Activate the
virtual environment
venvsource venv/bin/activate
install
required dependencies with the following commandpip install -r requirements.txt
Once the local copy of the project has been setup, follow these steps to generate fingerprints
To generate fingerprint for a particular PDB id, do the following steps:
-
Open terminal from the main project directory
-
Run the gen_fingerprint.py file with only the PDB id or PDB filename as argument
python gen_fingerprint.py <pdbid>
Example
python gen_fingerprint.py 2XNI or python gen_fingerprint.py 2XNI.pdb
-
An output will be displayed on the screen comprising the fingerprint obatined using all the 4 techniques mentioned earlier
Following four outputs are generated and saved in the mentioned files
- Neighbourhood based fingerprint - saved in
output/<pdb_id>/<pdb_id>_neighbour.txt
- Encoded Neighbourhood based fingerprint - saved in
output/<pdb_id>/<pdb_id>_neighbour_transformer.txt
- Kmer based fingerprint - saved in
output/<pdb_id>/<pdb_id>_aar_kmer.json
- Ligand MAACS Key - saved in
output/<pdb_id>/<pdb_id>_maacs.txt
To train and test a RandomForestRegressor to predict binding affinity of complexes
- Open terminal from the main project directory
- Run the binding_affinity_prediction.py file
python binding_affinity_prediction.py
- An output will be displayed on the screen showing the R^2 score of the model compared against PDBBind dataset
Distributed under the Apache License 2.0. See LICENSE.txt
for more information.
Name | Roll No. | |
---|---|---|
Debajyoti Dasgupta | 18CS30051 | debajyotidasgupta6@gmail.com |
Somnath Jena | 18CS30047 | somnathjena.2011@gmail.com |
List of resources we found helpful and we would like to give them some credits.
- CS61060 - Computational Biophysics: Algorithms to Applications
- ProLIF: a library to encode molecular interactions as fingerprints
- Protein-Ligand Interaction Fingerprints
- Snekmer: a scalable pipeline for protein sequence fingerprinting based on amino acid recoding
- PDBBind dataset - 2019
- Downloading PDB files
- PUG REST API
- RCSB Data API for SMILES
- Transformer is all you need