/moleculenet-smiles-bert-mixup

Training pre-trained BERT language model on molecular SMILES from the Molecule Net benchmark by leveraging mixup and enumeration augmentations.

Primary LanguagePythonApache License 2.0Apache-2.0

MoleculeNet SMILES BERT Mixup

This repository contains implementation of mixup strategy for text classification. The implementation is primarily based on the paper Augmenting Data with Mixup for Sentence Classification: An Empirical Study , although there is some difference.

Three variants of mixup are considered for text classification

  1. Embedding mixup: Texts are mixed immediately after word embeedding
  2. Hidden/Encoder mixup: Mixup is done prior to the last fully connected layer
  3. Sentence mixup: Mixup is done before softmax

Run Supervised Training with Late Mixup Augmentation

from tqdm import tqdm

SAMPLES_PER_CLASS = [50, 100, 150, 200, 250]
N_AUGMENT = [0, 2, 4, 8, 16]
DATASETS = ['bace', 'bbbp']
METHODS = ['embed', 'encoder', 'sent']
OUTPUT_FILE = 'eval_result_mixup_augment_v1.csv'
N_TRIALS = 20
EPOCHS = 20

for method in METHODS:
  for dataset in DATASETS:
      for sample in SAMPLES_PER_CLASS:
          for n_augment in N_AUGMENT:
              for i in tqdm(range(N_TRIALS)):
                  !python bert_mixup/late_mixup/train_bert.py --dataset-name={dataset} --epoch={EPOCHS} \
                  --batch-size=16 --model-name-or-path=shahrukhx01/muv2x-simcse-smole-bert \
                  --samples-per-class={sample} --eval-after={EPOCHS} --method={method} \
                  --out-file={OUTPUT_FILE} --n-augment={n_augment}
                  !cat {OUTPUT_FILE}

Run Supervised Training with Early Mixup Augmentation

from tqdm import tqdm

SAMPLES_PER_CLASS = [50, 100, 150, 200, 250]
N_AUGMENT = [2, 4, 8, 16, 32]
DATASETS = ['bace', 'bbbp']
OUTPUT_FILE = '/nethome/skhan/moleculenet-smiles-bert-mixup/eval_result_early_mixup.csv'
N_TRIALS = 20
EPOCHS = 100


for dataset in DATASETS:
    for sample in SAMPLES_PER_CLASS:
        for n_augment in N_AUGMENT:
            for i in tqdm(range(N_TRIALS)):
                !python bert_mixup/early_mixup/main.py --dataset-name={dataset} --epoch={EPOCHS} \
                --batch-size=16 --model-name-or-path=shahrukhx01/muv2x-simcse-smole-bert \
                --samples-per-class={sample} --eval-after={EPOCHS} \
                --out-file={OUTPUT_FILE} --n-augment={n_augment}
                !cat {OUTPUT_FILE}

Acknowledgement:

The code in this repository is mainly adapted from the repo "xashru/mixup-text".