/pathfinder

Find protein binders using ML

Primary LanguagePythonMIT LicenseMIT

Pathfinder

Goal of this experimental project is to build deep learning based search mechanism to efficiently find small molecule binders for given protein. Main hypothesis behind this project is that with correct embeddings for both small molecules and protein structures, it's possible to train a model that would allow to translate them to the same embeddings space.

If we could build such model, finding molecules for proteins would be simple task of finding most similar vectors, which would be very scalable and efficient way of screening enormous libraries of molecules.

Architecture

Entire project will consist of 3 major components.

Protein embedding model

First step is to train an embedding representation of a protein. There is some prior literature on this like dMASIF. There are some failed experiments in protein-embeddings dir with graph nns and 2d conv nets of adjecency matrices.

Current approach is to use GearNet and their pre-trained models trained on Alphafold2 predicted structures as basic embeddings for protein structures. Paper related to this code has also good explanation of tasks and evaluation strategies for embeddings of protein structures.

Useful datasets:

Molecule embedding model

Similarly to protein embeddings, we should calculate embeddings for small molecules.

Unlike proteins, molecule embeddings should require conformation (or 3d structure) of molecules, as number of rotatable bonds can quickly make any search problem intangible (billions of molecules ** thousands or more of conformations).

Some approaches for mol embeddings could be language models (based on SMILES or SMARTS representations) or graph based models.

There are few available datasets with unlabeled molecules. Typically from molecule vendors, but also academia.

Datasets:

  • Chembl is featurized dataset of 2.4M compounds with some features, including, in some cases, binding target proteins and their pChembl value (good measure of binding affinity for molecules)
  • Enamine - around 6M compounds available for purchase from Enamine
  • ZINC15 - huge dataset of 2B mols spread into smalled datasets

Embedding translator

This is new model that we have to train. This would use smaller dataset of bound molecules to train neural network that would translate protein embeddings to molecule embeddings. Current idea is to use contrastive learning and multimodal learning to train such embedding translator. This is new and largery unproven approach.

Recent development of multimodal architectures (text-to-image, image search etc) is good inspiration. Especially field of image search.

Datasets to train these models would need a protein pose with docked molecule. There are few available