/MolecularGPT

Primary LanguagePythonApache License 2.0Apache-2.0

MolecularGPT

Official code for "MolecularGPT: Open Large Language Model (LLM) for Few-Shot Molecular Property Prediction".

📌 News

[2024.6.18] We propose MolecularGPT, a language model sets new benchmarks for few-shot molecular property tasks.

🚀 Quick Start

Installation

The required packages can be installed by running

conda create -n MolecularGPT python==3.10
conda activate MolecularGPT
cd .\MolecularGPT 
bash init_env.sh
pip install git+https://github.com/ashvardanian/usearch-molecules.git@main

Download the Datasets

Train datasets

Chembl dataset
cd prompt_data/ 
wget http://bioinf.jku.at/research/lsc/chembl20/dataPythonReduced.zip 
unzip dataPythonReduced.zip 

cd dataPythonReduced 
wget http://bioinf.jku.at/research/lsc/chembl20/dataPythonReduced/chembl20Smiles.pckl 
wget http://bioinf.jku.at/research/lsc/chembl20/dataPythonReduced/chembl20LSTM.pckl 

cd .. 
rm dataPythonReduced.zip 
mkdir -p chembl_raw/raw 
mv dataPythonReduced/* chembl_raw/raw 
wget 'https://www.dropbox.com/s/vi084g0ol06wzkt/mol_cluster.csv?dl=1' 
mv 'mol_cluster.csv?dl=1' chembl_raw/raw/mol_cluster.csv

python transform.py --input-dir chembl_raw/raw --output-dir chembl_full > transform.out 
cd .. 
Chembl property dataset
cd prompt_data
filename='mole_graph_property.csv'
fileid='1oLxIDOzp8MY0Jhzc1m6E7SCOVAZO5L4D'
wget --load-cookies /tmp/cookies.txt "https://drive.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://drive.google.com/uc?export=download&id=${fileid}' -O- | sed -rn 's/.confirm=([0-9A-Za-z_]+)./\1\n/p')&id=${fileid}" -O ${filename} && rm -rf /tmp/cookies.txt
cd ..
QM9 dataset

Download from https://figshare.com/articles/dataset/Data_for_6095_constitutional_isomers_of_C7H10O2/1057646?backTo=/collections/Quantum_chemistry_structures_and_properties_of_134_kilo_molecules/978904
Then uncompress file dsgdb9nsd.xyz.tar.bz2 to ./prompt_data/qm9.csv

Test datasets

MoleculeNet Datasts
mkdir -p property_data
wget http://snap.stanford.edu/gnn-pretrain/data/chem_dataset.zip
unzip chem_dataset.zip
mv dataset property_data
CYP450 Datasts

Downloaded from https://github.com/shenwanxiang/ChemBench/blob/master/src/chembench/data_and_index/CYP450/CYP450.csv.gz
Then uncompress file CYP450.csv to ./property_data/cyp450/raw/CYP450.csv.

Construct the K-Shot instruction datasets

Train datasets

mkdir -p train_process
mkdir -p train_dataset
cd prompts
python generate_pretrain_dataset.py --generate_assay_text --generate_mole_text --generate_qm9_text --split_non_overlap --add_negation
cd ..
python prep_encode_train.py
python prep_index_train.py
python ICL_train.py

# 0,4-shot instruction dataset
mkdir -p train_dataset/0-4-shot
prep_0_4_shot.py
# 0,1,2,3,4-shot instruction dataset
mkdir -p train_dataset/01234-shot
prep_01234.py

Test datasets

mkdir -p test_process
mkdir -p test_dataset
python prep_test_dataset_aug.py --prompt_augmentation ''  # choices=['','rewrite','expand','detail','shorten','name']
python prep_encode_test.py
python prep_index_test.py

For classfication and regression task:

ICL_test_sim_cls.py
ICL_test_sim_reg.py

To construct the k-shot instructions arranged by ascending order:

ICL_test_reverse_cls.py
ICL_test_reverse_reg.py

To construct the k-shot instructions retrieved based on diversity :

ICL_test_diversity.py

Train the model

Download LLaMA-2-7b-chat from HuggingFace🤗

mkdir -p ckpts/llama

Download from https://huggingface.co/meta-llama/Llama-2-7b-chat-hf and move to ./ckpts/llama

Train the MolecularGPT

sbatch finetune_moleculargpt.sh

Evaluate the model

Download LoRA Weighs from HuggingFace🤗

mkdir -p ckpts/lora

Download the adapter_config.json and adapter_model.bin from https://huggingface.co/YuyanLiu/MolecularGPT and move to ./ckpts/lora

Evaluate the performance on classification tasks

mkdir -p cache
python downstream_test_llama_cla.py --load_8bit --base_model $model --lora_weights $lora --path $path --shot $shot

Evaluate the performance on regression tasks

python downstream_test_llama_reg.py --load_8bit --base_model $model --lora_weights $lora --path $path --shot $shot

Evaluate the bacelines

To evaluate the bacelines like GIMLET, MoMu, KVPLM, and Galactica, you could reference the GIMLET: A Unified Graph-Text Model for Instruction-Based Molecule Zero-Shot Learning

Reference Code

ID Project
1 GIMLET: A Unified Graph-Text Model for Instruction-Based Molecule Zero-Shot Learning
2 Mol-Instructions: A Large-Scale Biomolecular Instruction Dataset for Large Language Models
3 Chinese-LLaMA-Alpaca-2
4 stanford_alpaca
5 usearch-molecules

Citation

If you find this repo useful, please star the repo and cite:

@article{liu2024moleculargpt,
    title={MolecularGPT: Open Large Language Model (LLM) for Few-Shot Molecular Property Prediction},
    author={Yuyan Liu and Sirui Ding and Sheng Zhou and Wenqi Fan and Qiaoyu Tan},
    year={2024},
    eprint={2406.12950},
    archivePrefix={arXiv}
}