The official GitHub for SPMM, a multi-modal molecular pre-trained model for a synergistic comprehension of molecular structure and properties. The details can be found in the following paper: Bidirectional Generation of Structure and Properties Through a Single Molecular Foundation Model. (Nature Communications 2024)
Molecule structure will be given in SMILES, and we used 53 simple chemical properties to build a property vector(PV) of a molecule.
The model checkpoint and data are too heavy to be included in this repo, and they can be found here.
data/
: Contains the data used for the experiments in the paper. (you have to make this folder and put the data that you downloaded from the link above.)Pretrain/
: Contains the checkpoint of the pre-trained SPMM. (you have to make this folder and put the checkpoint that you downloaded from the link above.)vocab_bpe_300.txt
: Contains the SMILES tokens for the SMILES tokenizer.property_name.txt
: Contains the name of the 53 chemical properties.normalize.pkl
: Contains the mean and standard deviation of the 53 chemical properties that we used for PV.calc_property.py
: Contains the code to calculate the 53 chemical properties and build a PV for a given SMILES. Modify this code accordingly to utilize SPMM pre-training for your custom PVs.SPMM_models.py
: Contains the code for the SPMM model and its pre-training codes.SPMM_pretrain.py
: runs SPMM pre-training.d_*.py
: Codes for the downstream tasks.
Run pip install -r requirements.txt
to install the required packages.
Arguments can be passed with commands, or be edited manually in the running code.
-
Pre-training
python SPMM_pretrain.py --data_path './data/pretrain.txt'
-
PV-to-SMILES generation
- batched: The model takes PVs from the molecules in
input_file
, and generates molecules with those PVs using k-beam search. The generated molecules will be written ingenerated_molecules.txt
.python d_pv2smiles_batched.py --checkpoint './Pretrain/checkpoint_SPMM.ckpt' --input_file './data/pubchem_1k_unseen.txt' --k 2
- single: The model takes one query PV and generates
n_generate
molecules with that PV using k-beam search. The generated molecules will be written ingenerated_molecules.txt
. Here, you need to build your input PV in the code. Check the four examples that we included.python d_pv2smiles_single.py --checkpoint './Pretrain/checkpoint_SPMM.ckpt' --n_generate 1000 --stochastic True --k 2
- batched: The model takes PVs from the molecules in
-
SMILES-to-PV generation
The model takes the query molecules in
input_file
, and generates their PV.python d_smiles2pv.py --checkpoint './Pretrain/checkpoint_SPMM.ckpt' --input_file './data/pubchem_1k_unseen.txt'
-
MoleculeNet + DILI prediction task
d_regression.py
,d_classification.py
, andd_classification_multilabel.py
, perform regression, binary classification, and multi-label classification tasks, respectively.python d_regression.py --checkpoint './Pretrain/checkpoint_SPMM.ckpt' --name 'bace' python d_classification.py --checkpoint './Pretrain/checkpoint_SPMM.ckpt' --name 'bbbp' python d_classification_multilabel.py --checkpoint './Pretrain/checkpoint_SPMM.ckpt' --name 'clintox'
-
Forward/retro-reaction prediction tasks
d_rxn_prediction.py
performs both forward/reverse reaction prediction tasks on USPTO-480k and USPTO-50k datasets.e.g. forward reaction prediction, no beam search
python d_rxn_prediction.py --checkpoint './Pretrain/checkpoint_SPMM.ckpt' --mode 'forward' --n_beam 1
e.g. retro reaction prediction, beam search with k=3
python d_rxn_prediction.py --checkpoint './Pretrain/checkpoint_SPMM.ckpt' --mode 'retro' --n_beam 3
- The code for BERT with cross-attention layers
xbert.py
and schedulers are modified from the one in ALBEF. - The code for SMILES augmentation is taken from pysmilesutils.