Source code for "A Deep-learning System Bridging Molecule Structure and Biomedical Text with Comprehension Comparable to Human Professionals"

Primary LanguagePython


Source code for A Deep-learning System Bridging Molecule Structure and Biomedical Text with Comprehension Comparable to Human Professionals on Nat.Commun. Our operating system is Ubuntu 16.04. For training process, the 2080 Ti GPU is used.

Simplified Instruction

If you want to quickly explore our job or do not have much deep-learning experience, you can simply follow the instructions in this section.

  • Step 1: Download the zip or clone the repository to your workspace.
  • Step 2: Download the ckpt_ret01.pt from googledrive. Create a new directory by mkdir save_model and then put the downloaded model under save_model/ directory.
  • Step 3: Install Anaconda (py3) and then create a conda environment by the following command (remember to input 'y' when asked Proceed ([y]/n)? ):
conda create -n KV python=3.6
conda activate KV
sh scripts/conda_environment.sh

Note that there may be error when installing transformers if you're using MacOS. See here for help.

  • Step 4: Check the demo_matching.py file and set if_cuda=False in line 7 if there is no GPU available. Run the command:
python demo_matching.py

And then explore the versatile reading task following the instructions of the program:

# input the SMILES string and the textual description you want to match and then type enter
SMILES_string: >> CC(CN)O
description: >> It is an amino alcohol and a secondary alcohol.
# the program will return a score between 0 and 1 (higher is more similar)
Matching_score =  0.8025086522102356

# automatically loop until you type control+C
SMILES_string: >>


  • torch==1.6.0
  • transformers>=3.3.1
  • numpy>=1.19.3
  • sklearn
  • tqdm
  • seqeval
  • chainer_chemistry
  • subworm-nmt

We strongly suggest you to create a conda environment for this project. Installation is going to be finished in a several minutes.

conda create -n KV python=3.6
conda activate KV
sh scripts/conda_environment.sh


KV-PLM and other pre-trained models can be downloaded from googledrive. We recommend you to download the models and put them under save_model/ before running the code.

If you are going to run the code without the pre-training models above, please choose 'Sci-BERT' mode for $MODEL.

File Usage

The users may be going to use the files below:

  • run_chem.py: Fine-tuning code for ChemProt dataset
  • run_molecule.py: Fine-tuning code for MoleculeNet dataset
  • run_ner.py: Fine-tuning code for BC5CDR NER task
  • run_USPTO.py: Fine-tuning code for USPTO-1k few-shot dataset
  • chemprot/
    • preprocess.py: Data pre-processing code for ChemProt
    • train/dev/test.txt: Raw data for ChemProt
  • MoleculeNet/
    • *.txt / *.npy: Pre-processed data for MoleculeNet task
  • NER/
    • preprocess.py: Data pre-processing code for BC5CDR
    • BC5CDR/: Raw data for BC5CDR
  • Ret/
    • align_des_filt.txt: Molecule description text
    • align_smiles.txt: Molecule SMILES text
    • calcu_sent.py: PCdes_choice test code
    • calcu_test.py: Retrieval training evaluation code
    • preprocess.py: Data pre-processing code for versatile reading
  • USPTO/
    • *.txt / *.npy: Pre-processed data for USPTO task
  • scripts/
    • data_preprocess.sh: Data pre-processing bash file
    • finetune.sh: Molecule Structure tasks and Natural Language tasks fine-tuning bash file
    • versatile_reading.sh: Versatile Reading tasks fine-tuning bash file
    • smiles_bpe.sh: Util file to generate bpe subwords

Data Preprocessing

Switch to scripts/ directory and run the following command to pre-process the raw data (remember to create the 'Ret/Sci' file before running):

sh data_preprocessing.sh

Edit the smiles_bpe.sh file and run it to use BPE tokenizer and get subword results.

Downstream Tasks

We strongly recommend you to test our code with a GPU. Usually each downstream fine-tuning process takes no more than an hour.

Currently we support downstream fine-tuning and validation on rxnfp, Sci-BERT, KV-PLM and KV-PLM*.

For Molecule Structure Tasks and Natural Language Tasks, go to scripts/ directory and modify the finetune.sh script according to:

# 'bc5cdr' for NER, 'chemprot' for RE, 'uspto' for chemical reaction classification and 'moleculenet' for molecule property classification.
# for MoleculeNet, we support 4 sub-tasks: 'BBBP', 'sider', 'HIV' and 'tox21'.
# could be 'Sci-BERT', 'KV-PLM', 'KV-PLM*', 'BERT' or 'SMI-BERT'.

Then run the script.

For Versatile Reading Tasks, versatile_reading.sh provides the training process. Modify the script according to:

# same to above

Run the script and you will get the fine-tuned model and encoding result for the test sets in ../Ret/output_sent/ directory. Go to ../Ret/ and run calcu_test.py and calcu_sent.py for evaluation.

In Ret/data/ directory, you can see PCdes_for_human.txt which is PCdes test example that we provide to human professional annotators.


We provide a simple demo for versatile reading exploring. Download the ckpt_ret01.pt and put it under save_model/ directory. Run python demo_matching.py and input your SMILES string and description sentence following the instruction. Set if_cuda=False if there is no GPU available, and the model loading will take around 30 s.

There are some examples:

- description: It is an amino alcohol and a secondary alcohol.
- matching score: 0.8025(True)

- description: A hydroxy acid with anti-inflammatory effect.
- matching score: 0.4086(False)

- description: flammable liquid with a pleasant smell.
- matching score: 0.5849(True)

- description: a clear colorless liquid with a pungent odor.
- matching score: 0.2279(False)

- description: A hydroxy acid with anti-inflammatory effect. It has a role as metabolite.
- matching score: 0.4795(True)

- description: appears as pale yellow needles, almond odor.
- matching score: 0.4499(True)

You can also test the matching score between two SMILES strings:

- description: C1=CC=C(C(=C1)C(=O)O)O
- matching score: 0.7464(True)

- SMILES: C1=CC=C(C(=C1)C(=O)O)O
- description: C1(C(C(=O)OC1C(C(=O)O)O)O)O
- matching score: 0.1287(False)


Please cite our paper if you find it helpful.

    title={A Deep-learning System Bridging Molecule Structure and Biomedical Text with Comprehension Comparable to Human Professionals},
    author={Zheni Zeng, Yuan Yao, Zhiyuan Liu, Maosong Sun},
    journal={Nature communications},
    publisher={Nature Publishing Group}