Perplexity-based molecule ranking and bias estimation of chemical language models

Table of Contents

  1. Description
  2. Requirements
  3. How to run an experiment
  4. How to cite this work
  5. License
  6. Address

Description

This is the supporting code for the paper «Perplexity-based molecule ranking and bias estimation of chemical language models»

Abstract of the paper: Chemical language models (CLMs) can be employed to design molecules with desired properties. CLMs generate new chemical structures in the form of textual representations, such as the simplified molecular input line entry systems (SMILES) strings, in a rule-free manner. However, the quality of these de novo generated molecules is difficult to assess a priori. In this study, we apply the perplexity metric to determine the degree to which the molecules generated by a CLM match the desired design objectives. This model-intrinsic score allows identifying and ranking the most promising molecular designs based on the probabilities learned by the CLM. Using perplexity to compare “greedy” (beam search) with “explorative” (multinomial sampling) methods for SMILES generation, certain advantages of multinomial sampling become apparent. Additionally, perplexity scoring is performed to identify undesired model biases introduced during model training and allows the development of a new ranking system to remove those undesired biases.

Requirements

First, you need to clone the repository:

git clone git@github.com:ETHmodlab/CLM_perplexity.git

Then, you can run the following command, which will create a conda virtual environment and install all the needed packages (if you don't have conda, you can follow the instructions to install it here).

cd CLM_perplexity/
conda env create -f environment.yml

Once the installation is done, you can activate the virtual conda environment for this project:

conda activate plex

Please note that you will need to activate this virtual conda environment every time you want to use this project.

How to run an experiment

You can run an example experiment based on the data used in the paper by following the procedure described in A. and B.
The output after each step can be find in CLM_perplexity/experiments/outputs/FT/{configuration_file_name}/.

A. Compare the perplexity of SMILES strings generated with multinomial sampling and the beam search.

A1. Process the data to fine-tune the chemical language model (CLM): Note: the first set of commands takes as argument the path to the configuration file (in this example, configfiles/FT/A01.ini where A01.ini is the name of the configuration file, and FT stands for Fine-Tuning).

cd experiments/
sh run_data_processing.sh configfiles/FT/A01.ini

Note: the pretrained weights of the CLM are provided.

A2. Fine-tune the CLM:

sh run_training.sh configfiles/FT/A01.ini

A3. Generate SMILES strings with multinomial sampling: Note: the following list of commands will take as arguments the path to the configuration file, and the range of epochs at which you want to carry the experiment (start, step, end). In this example, the experiment will be done for SMILES strings sampled at epoch 2 and 4.

sh run_generation_multinomial.sh configfiles/FT/A01.ini 2 2 4

A4. Process the generated SMILES strings:

sh run_process_multinomial_generated.sh configfiles/FT/A01.ini 2 2 4

A5. Extract the probabilities from the CLM:

sh run_proba_extraction_multinomial.sh configfiles/FT/A01.ini 2 2 4

A6. Compute the perplexity:

sh run_get_perplexity_multinomial.sh configfiles/FT/A01.ini 2 2 4

You can find a .csv file with the results in the outputs/ directory, under perplexity/.
Note: only the de novo molecules (with respect to the pretraining and fine-tuning data) will be considered. You can change the argument in the bash file (.sh) if you also want to consider not de novo molecules.

A7. Generate SMILES strings with the beam search:

sh run_generation_beam.sh configfiles/FT/A01.ini 2 2 4

A8. Process the beam search generated SMILES strings:

sh run_process_beam_generated.sh configfiles/FT/A01.ini 2 2 4

A9. Extract the probabilities from the CLM:

sh run_proba_extraction_beam.sh configfiles/FT/A01.ini 2 2 4

A10. Compute the perplexity:

sh run_get_perplexity_beam.sh configfiles/FT/A01.ini 2 2 4

B. Compute the delta rank of the SMILES strings generated with multinomial sampling.

B1. Extract the probabilities with the pretrained CLM:

sh run_proba_extraction_multinomial_from_pretrained.sh configfiles/FT/A01.ini 2 2 4

B2. Compute the perplexity:

sh run_get_perplexity_multinomial_from_pretrained.sh configfiles/FT/A01.ini 2 2 4

B3. And finall, compute the delta:

sh run_get_delta.sh configfiles/FT/A01.ini 2 2 4

How to cite this work

tbd

License

MIT License

Address

MODLAB
ETH Zurich
Inst. of Pharm. Sciences
HCI H 413
Vladimir-Prelog-Weg 4
CH-8093 Zurich