/enumeration-aware-molecule-transformers

Molecule Transformers is a collection of recipes for pre-training and fine-tuning molecular transformer language models, including BART, BERT, etc. Full thesis available at https://moleculetransformers.github.io/thesis_cs_msc_Khan_Shahrukh.pdf.

Primary LanguagePythonMIT LicenseMIT

Enumeration-aware Molecular Transformers for Representation Learning

Overview

We introduce a suite of neural language model tools for pre-training, fine-tuning SMILES-based molecular language models. Furthermore, we also provide recipes for semi-supervised recipes for fine-tuning these languages in low-data settings using Semi-supervised learning.

1. Enumeration-aware Molecular Transformers

Introduces contrastive learning alongside multi-task regression, and masked language modelling as pre-training objectives to inject enumeration knowledge into pre-trained language models.

a. Molecular Domain Adaptation (Contrastive Encoder-based)

i. Architecture

smole bert drawio

ii. Contrastive Learning

Screenshot 2023-04-22 at 11 54 23 AM

b. Canonicalization Encoder-decoder (Denoising Encoder-decoder)

Screenshot 2023-04-22 at 11 43 06 AM

Code

You can reproduce the experiment by:

Install depedencies

bash pip install -r requirements.txt

1.Pre-training the molecular transfomers

The detailed steps are for pre-training with Encoder-based architectures pertaining MLM, MTR and Seq2Seq BART with denoising objectives are outlined in here.

2. Domain Adaptation with Contrastive Learning and Multitask Learning

To reproduce the domain adaptation step from our work please follow the guidelines here.

3. Finetuning

Finally for finetuning the domain adapted molecular languages on downstream tasks are explained in the accompanying notebook which can be found here.

Acknowledgements

Code base adapted from: