This repo offers many possible ways to retrieve molecules that are similar to a target molecule from a large molecule library.
-
Clean up the hard-coding and unnecessary files.
-
Support SMILES processing for the corpus.
-
Benchmark (performance, efficiency) different retrieval methods.
-
Support more promising retrieval methods.
A retrieval process consists of three parts:
-
Build molecule corpus. We need to have a large candidate moelcule library, from which we can retreive molecules. Here we provide the example of using eMolecules, which consist of more than 200M commercially available molecules. You can download the latest version of
version.smi.gz
for all the smiles. -
Choose an embedding type. To accelerate the retrieval process, we have to convert each molecule to an embedding first. We provide various choices including SMILE-based fingerprints (
MACCSkeys
,RDKFingerprint
,EstateFingerprint
), molecule language models (ChemBERTa
,MolT5
,BioT5
), and graph-based molecule representations (Grover
,AttrMask
,GPT-GNN
,GraphCL
,GraphMVP
,MolCLR
) learned via self-supervised learning. -
Choose a distance function. Once we've got the embeddings of the corpus and our target molecule, we need to choose a distance function to measure the similarity of two embeddings. Then the top-k similar molecules from the corpus will be retrieved by searching (we use
heapq.nsmallest
). We provide many distance choices such astanimoto
,dice
,cosine
,euclidean
,sokal
,russel
,kulczynski
,McConnaughey
. Notice that the majority of the distances are for fingerprint-based embeddings. If you are using neural representations, we suggest trying onlycosine
andeuclidean
distances.
- Packages install. The packages used are as follows (different version may also work).
tqdm==4.65.0
torch==2.0.0
torchvision==0.15.0
rdkit==2023.9.4
transformers==4.37.1
selfies==2.1.1
networkx==3.1
torch_geometric==2.4.0
torch-cluster==1.6.1+pt20cu117
torch_geometric==2.4.0
torch-scatter==2.1.1+pt20cu117
ogb==1.3.5
Notice: ogb==1.3.5
is for SSL-based models likr GraphMVP. Here's an issue about the version of ogb.
-
Preprocessing corpus. Download your molecule corpus from eMolecules in SMILES form. Process it into a line "SMILES size" (such as
COO 3
,C1CCCCC1 6
) for each molecule, and save it inSMILES_LIB_PATH
assigned inutils/env_utils.py
. We will use the code snippet below to load this corpus:with open(SMILES_LIB_PATH, 'r') as f: smiles = [x.strip().split()[0] for x in f.readlines()]
The molecule sizes are used for prunning strategy. When retrieving from the corpus, we may sort the corpus first and only search the molecule with similar size. But it's ok not to use prunning.
-
Prepare model checkpoints. To run
ChemBERTa
,BioT5
,MolT5
, and other SSL-based graph models, you need to download the corresponding checkpoints. When they're done, make sure the paths inutils/env_utils.py
is correct.-
Download molecule language models: We suggest using
huggingface-cli
(check this guide, you need to install the latesthuggingface_hub
for downloading) to download the models. Use the following command (here's an examplar command for ChemBERTa):MODEL_DIR="ChemBERTa-77M-MTR" HF_PATH="DeepChem/ChemBERTa-77M-MTR" # for biot5 and molt5, you can choose "laituan245/molt5-base" and "QizhiPei/biot5-base". mkdir $MODEL_DIR cd $MODEL_DIR huggingface-cli download $HF_PATH --local-dir ./
-
Download SSL-based models. These models can be downloaded from the repo for GraphMVP. They can be found in the Google Drive. You can download the corresponding checkpoints for
Grover
(Motif.pth
),AttrMask
(AM.pth
),GPT-GNN
(GPT_TNN.pth
),GraphCL
(GraphCL.pth
). ForGraphMVP
, you should download theGraphMVP_complate_features_for_regression.zip
from here, where the model isGraphMVP_complate_features_for_regression/GraphMVP/pretraining_model.pth
. ForMolCLR
, you can download it (model.pth
) from this repo.
-
-
Build embedding libs. We provide code for building embedding library as
build_lib.py
andretrieve/models/SSL/utils.py
andretrieve/models/MolCLR/utils.py
(they will be merged together in the future).build_lib.py
supports the embedding of fingerprints and molecule language models. The lib building may be time-consuming (maybe several hours). You can adjust theblocksize
andchunksize
according to your hardware (blocksize is likebatch_size
in ML training, and we will save the embeddings in one file for one chunk). If you don't want to retrieve using some specific embeddings, you can skip the lib building for those types of embedding. -
Begin Retrieval! We'd like to use a
config.yaml
for argument parsing (seeretrieve/configs
for examples). A config file typically consists of the arguments such asdistance_type
,embedding_type
,prunning
,query_path
,save_path
,topk
.- The supported distance types are:
distanceType = ( 'Tanimoto', 'Dice', 'Cosine', 'Euclidean', 'Sokal', 'Russel', 'Kulczynski', 'McConnaughey', 'random' )
- The supported embedding types are:
embeddingTypes = ( 'RDKFingerprint', 'MACCSkeys', 'EStateFingerprint', 'ChemBERTa', 'MolT5', 'BioT5', 'AttrMask', 'GPT-GNN', 'GraphCL', 'MolCLR', 'GraphMVP', 'GROVER', 'random' )
-
The
query_path
is a txt file that contains all the target molecules, which we want to retrieve similar molecules for them. One line is a SMILES of one molecule. -
save_path
is a dir that you want to save the retrieved results. -
top_k
is the expected number of retrieved molecules.
This repo is currently a very initial version. If you have any questions or you'd like to contribute to this repo, feel free to email Haowei or just open an issue, or even make a pull request. Welcome contribute to this repo to make it more helpful!