/ldmol

GitHub repository for LDMol, a latent text-to-molecule diffusion model.

Primary LanguagePythonApache License 2.0Apache-2.0

LDMol

Official GitHub repository for LDMol, a latent text-to-molecule diffusion model. The details can be found in the following paper:

LDMol: Text-Conditioned Molecule Diffusion Model Leveraging Chemically Informative Latent Space. (arxiv 2024)

ldmol_fig2


ldmol_fig3

LDMol not only can generate molecules according to the given text prompt, but it's also able to perform various downstream tasks including molecule-to-text retrieval and text-guided molecule editing.

The model checkpoint and data are too heavy to be included in this repo and can be found in here.

Requirements

Run conda env create -f requirements.yaml and it will generate a conda environment named ldmol.

Inference

Check out the arguments in the script files to see more details.

1. text-to-molecule generation

  • zero-shot: The model gets a hand-written text prompt.
    CUDA_VISIBLE_DEVICES=0,1 torchrun --nnodes=1 --nproc_per_node=2 inference_demo.py --num-samples 100 --ckpt ./Pretrain/checkpoint_ldmol.pt --prompt="This molecule includes benzoyl group." --cfg-scale=5
    
  • benchmark dataset: The model performs text-to-molecule generation on ChEBI-20 test set. The evaluation metrics will be printed at the end.
    TOKENIZERS_PARALLELISM=false CUDA_VISIBLE_DEVICES=0 torchrun --nnodes=1 --nproc_per_node=1 inference_t2m.py --ckpt ./Pretrain/checkpoint_ldmol_chebi20.pt --cfg-scale=3.5
    

2. molecule-to-text retrieval

The model performs molecule-to-text retrieval on the given dataset. --level controls the quality of the query text(paragraph/sentence). --n-iter is the number of function evaluations of our model.

TOKENIZERS_PARALLELISM=false CUDA_VISIBLE_DEVICES=0 torchrun --nnodes=1 --nproc_per_node=1 inference_retrieval_m2t.py --ckpt ./Pretrain/checkpoint_ldmol.pt --dataset="./data/PCdes/test.txt" --level="paragraph" --n-iter=10

3. text-guided molecule editing

The model performs a DDS-style text-guided molecule editing. --source-text should describe the --input-smiles. --target-text is your desired molecule description.

TOKENIZERS_PARALLELISM=false CUDA_VISIBLE_DEVICES=0 torchrun --nnodes=1 --nproc_per_node=1 inference_dds.py --ckpt ./Pretrain/checkpoint_ldmol.pt --input-smiles="C[C@H](CCc1ccccc1)Nc1ccc(C#N)cc1F" --source-text="This molecule contains fluorine." --target-text="This molecule contains bromine."

Acknowledgement

  • The code for DiT diffusion model is based on & modified from the official code of DiT.
  • The code for BERT with cross-attention layers xbert.py and schedulers are modified from the one in ALBEF.