Primary LanguagePython

Sentence Level Pretraining for Natural Language Inference


In this paper we discuss a novel approach to improve the performance of transformer-based models for Natural language inference (NLI) task. We hypothesize the knowledge of the relationships between the premise and the hy- potheses needed to be successful in NLI can be extracted from an unannotated corpus in a self- supervised manner. We propose two training objectives to achieve this: Sentence Level Lan- guage Modelling (SL-LM) and Sentence Level Masked Language Modelling (SL-MLM). To show the conceptual validity of this hypothesis we compare performance of transformer-based models with pretraining to the non-pretrained models on a chosen NLI task.

See paper here. Contributors:

  • Sungjun Han
  • Anastasiia Sirotina

This is the code release for the paper, we do not release the pretrained models due to size restriction on github.

Repository Structure

  • scripts : all runnable python scripts (i.e. train, test, build vocabulary ...) are in this folder
    • scripts/run_scripts : all .sh scripts for running gpt2/bert training
    • scripts/baseline : all train/test python scripts for BoW and DL baseline models
      • scripts/baseline/baseline_train_test.py : train/test python script for the BoW baseline models
      • scripts/baseline/dl_baseline_train_test.py : train/test python script for the DL baseline models
    • scripts/bert : all train/test python scripts for pretraining/fine-tuning BERT
      • scripts/baseline/advanced_train_test.py : train/test python script for the BERT based models
      • scripts/baseline/pretrain_mlm.py : train/test python script for the further pretraining BERT
    • scripts/gpt2 : all train/test python scripts for pretraining/fine-tuning GPT-2
      • scripts/baseline/gpt2test.py : test python script for GPT-2
      • scripts/baseline/gpt2train.py : train (fine-tuning) python script for GPT-2
      • scripts/baseline/pretrain-lm-aux.py : pretraining (fine-tuning) python script for GPT-2 with auxiliary pretraining objective
      • scripts/baseline/preetrain-lm.py : pretraining (fine-tuning) python script for GPT-2 without auxiliary pretraining objective
    • scripts/annotate.py : creating manual human annotations for ART
    • scripts/build_vocab.py : building vocabulary for the baselines
  • nli : auxiliary functions used by the files in scripts
    • nli/models : all models are defined in this directory (baseline/GPT2/BERT)
      • nli/models-lm/BoW.py : all python object classes for BoW baseline model
      • nli/models-lm/GPT2.py : all PyTorch nn.Module classes for GPT2 models
      • nli/models-lm/StaticEmb.py : all PyTorch nn.Module classes for DL baseline models
      • nli/models-lm/Transformers.py : all PyTorch nn.Module classes for transformer baseline models
    • nli/pretrain-mlm : all functions used by BERT training/testing for pretraining
      • nli/pretrain-mlm/dataloader.py : holds PyTorch datasset and dataloader objects for pretraining for BERT
    • nli/pretrain-lm : all models (baseline/GPT2/BERT)
      • nli/pretrain-lm/ft_dataloader.py : holds PyTorch datasset and dataloader objects for finetuning for GPT-2
      • nli/pretrain-lm/pt_dataloader.py : holds PyTorch datasset and dataloader objects for pretraining for GPT-2
    • nli/dataloader.py : holds PyTorch datasset and dataloader objects for preparing ART for baselines and BERT
    • nli/embedding.py : loads GloVe embeddings to be used by DL baselines
    • nli/metrics.py : holds an object class that is used to keep evaluation results during training/testing
    • nli/preprocess.py : functions used for preprocessing the dataset
    • nli/similarity.py : distance and similarity measures for BoW baseline
    • nli/tokenization.py : holds a tokenizer object class used by DL baselines
    • nli/utils.py : various utility functions


conda env create -f environment.yml
conda activate nlplab
pip install -e .

Make sure to have a data folder with the aNLI data present: data/alphali/...


Try how good you are at this task! This allows you to annotate --max_samples number of randomly selected aNLI data points.

python scripts/annotate.py --max_samples 30 --annot_pred <PROVIDE OUTPUT PATH HERE>
python scripts/evaluate.py --label_dir <PROVIDE OUTPUT PATH FOR ANNOTATION HERE> --pred_dir  <PROVIDE GROUND TRUTH LABEL FILE PATH>
cat eval_result.json

Build a vocabulary

for deep learnign models - baseline models, we need to make a vocabulary to initilaize our WhiteSpaceTokenizer. For the pre-trained models, we will just use the availble pretrained sub-word tokenizers.

python scripts/build_vocab.py --out_dir <VOCABULARY OUTPUT DIRECTORY>

you can choose between two types under --vocab_type

  1. regular: 'reg'
  2. Byte-Pair-Encoding: 'bpe' #not implemented yet

you can also set vocabulary specifiction parameters

  1. --min_occurence : minimum occurence for a word type to be included in the vocabulary
  2. --vocabulary_size : desired vocabulary size, selects the top frequent word types and filters out the rest

Train a base line model

Train baseline model : Bag of Words

Note that for the baseline BoW model with Maximum entropy classifier - there is no need to run the model more than one epoch.

Perceptron using Levenshtein

to run a model that scored 50.97%,

python scripts/baseline/baseline_train_test.py \
--model_type BoW \
--train_tsv data/alphanli/tsv/train.tsv \ 
--test_tsv data/alphanli/tsv/dev.tsv \ 
--bow_classifier prc \ 
--num_epochs 1 \
--bow_sim_function levenshtein \
--bow_weight_function idf \
--bow_max_cost 100 \ 
--bow_lemmatize True \
--bow_bidirectional False \
--bow_me_num_buckets 30 \
--bow_me_step_size 0.1 \

Perceptron using Distributional

to run a model that scored 50.79%,

python scripts/baseline/baseline_train_test.py \
--model_type BoW \
--train_tsv data/alphanli/tsv/train.tsv \ 
--test_tsv data/alphanli/tsv/dev.tsv \ 
--bow_classifier prc \ 
--num_epochs 1 \
--bow_sim_function distributional \
--bow_weight_function idf \
--bow_max_cost 100 \ 
--bow_lemmatize True \
--bow_bidirectional False \
--bow_me_num_buckets 30 \
--bow_me_step_size 0.1 \

Maximum Entropy using IDF / Levenshtein / Lemmatization

to run a model that scored 50.13%,

python scripts/baseline/baseline_train_test.py \
--model_type BoW \
--train_tsv data/alphanli/tsv/train.tsv \ 
--test_tsv data/alphanli/tsv/dev.tsv \ 
--bow_classifier maxent \ 
--num_epochs 1 \
--bow_sim_function levenshtein \
--bow_weight_function idf \
--bow_max_cost 100 \ 
--bow_lemmatize True \
--bow_bidirectional False \
--bow_me_num_buckets 100 \
--bow_me_step_size 0.1 \

Maximum Entropy using Distributional

to run a model that scored 51.52%,

python scripts/baseline/baseline_train_test.py \
--model_type BoW \
--train_tsv data/alphanli/tsv/train.tsv \ 
--test_tsv data/alphanli/tsv/dev.tsv \ 
--bow_classifier maxent \ 
--num_epochs 1 \
--bow_sim_function distributional \
--bow_weight_function idf \
--bow_max_cost 100 \ 
--bow_lemmatize True \
--bow_bidirectional False \
--bow_me_num_buckets 30 \
--bow_me_step_size 0.1 \

Train baseline DL models : FFN/RNN/CNN encoder with FFN decoder

These models use pre-trained embeddings

FFN - train for 50 epochs using SUM pooling method

this can also be run using early stopping by setting early stopping > 0

To replicate the baseline accuracy score of 52.73%, run

python scripts/baseline/dl_baseline_train_test.py \
--train_tsv data/alphanli/tsv/train.tsv \ 
--test_tsv data/alphanli/tsv/dev.tsv \ 
--model_type StaticEmb-mixture \
--sem_pooling sum \
--use_cuda True \
--batch_size 128 \
--learning_rate 5e-4 \
--optimizer adam \ 
--se_num_encoder_layers 3 \
--se_num_decoder_layers 3 \
--glove_model glove-wiki-gigaword-50 \
--evaluate true \
--early_stopping 0 \
--num_epochs 50 \
--seed 1234 \

RNN - train for 50 epochs

this can also be run using early stopping by setting early stopping > 0

To replicate the baseline accuracy score of 55.10%, run

python scripts/baseline/dl_baseline_train_test.py \
--model_type StaticEmb-rnn \
--train_tsv data/alphanli/tsv/train.tsv \ 
--test_tsv data/alphanli/tsv/dev.tsv \  
--use_cuda True \
--batch_size 128 \
--learning_rate 5e-4 \
--optimizer adam \ 
--se_num_encoder_layers 2 \
--se_num_decoder_layers 3 \
--glove_model glove-wiki-gigaword-50 \
--sernn_bidirectional true \
--evaluate true \
--early_stopping 0 \
--num_epochs 50 \
--seed 1234 \

CNN - train for 100 epochs

this can also be run using early stopping by setting early stopping > 0

To replicate the baseline accuracy score of 56.13%, run

python scripts/baseline/dl_baseline_train_test.py \
--model_type StaticEmb-cnn \
--train_tsv data/alphanli/tsv/train.tsv \ 
--test_tsv data/alphanli/tsv/dev.tsv \ 
--use_cuda True \
--batch_size 128 \
--learning_rate 1e-4 \
--optimizer adam \ 
--glove_model glove-wiki-gigaword-50 \
--evaluate true \
--early_stopping 0 \
--num_epochs 100 \
--seed 1234 \

BERT with CLS token

this can also be run using early stopping by setting early stopping > 0. We use huggingface implementation for pretrained transformer models.

To replicate the baseline accuracy score of 61.82%, run

python scripts/baseline/advanced_train_test.py \
--model_type pretrained-transformers-cls \
--pretrained_name bert-base-uncased \
--train_tsv data/alphanli/tsv/train.tsv \ 
--test_tsv data/alphanli/tsv/dev.tsv \ 
--batch_size 128 \
--early_stopping 0 \ 
--num_epochs 15 \ 
--evaluate True \
--learning_rate 1e-5 \
--use_cuda True \
--scheduler True \
--weight_decay 0.0 \ 
--seed 1234 \


all pretrainig configurations are kept in .sh files - scripts/run_scripts/

GPT-2 Sentence Level Language Modelling (SL-LM)

For multi-gpu:


For single-gpu:


BERT Sentence Level Masked Language Modelling (SL-MLM)

GPT-2 Fine-tuning

Training with SL-LM pretrained model

scripts/run_scripts/gpt2-dual-single-gpu.sh --from_pretrained <INSERT_PRETRAINED_PATH>

Training without SL-LM pretrained model

scripts/run_scripts/gpt2-dual-single-gpu.sh --from_pretrained None

BERT Fine-tuning

Training with SL-MLM pretrained model


Training without SL-MLM pretrained model
