Bootstrap Your Own PLM: Boosting Semantic Features of PLMs for Unsuperivsed Contrastive Learning

This repo contains implementation of Bootstrap Your Own PLM: Boosting Semantic Features of PLMs for Unsuperivsed Contrastive Learning. Our code is mainly based on the code of SimCSE. Please refer to their repository for more detailed information.

Overview

This paper aims to investigate the possibility of exploiting original semantic features of PLMs (pre-trained language models) during contrastive learning in the context of SRL (sentence representation learning). We propose BYOP, which boosts well-represented features, taking the opposite idea of IFM, under the assumption that SimCSE’s dropout-noise-based augmentation may be too simple to modify high-level semantic features, and that the features learned by PLMs are semantically meaningful and should be boosted, rather than removed.

Setups

Requirements

First, install PyTorch by following the instructions from the official website.

pip install torch==1.12.1+cu116 --extra-index-url https://download.pytorch.org/whl/cu116

If you instead use CUDA <11 or CPU, install PyTorch by the following command,

pip install torch==1.12.1

Then run the following script to install the remaining dependencies,

pip install -r requirements.txt

Download the pretraining dataset

cd data
bash download_wiki.sh

Download the downstream dataset

cd Eval/data/
bash download.sh

Training

python train.py \
    --model_name_or_path bert-base-uncased \
    --train_file data/wiki1m_for_simcse.txt \
    --output_dir <output_model_dir> \
    --num_train_epochs 1 \
    --per_device_train_batch_size 64 \
    --learning_rate 3e-5 \
    --max_seq_length 32 \
    --evaluation_strategy steps \
    --metric_for_best_model stsb_spearman \
    --load_best_model_at_end \
    --eval_steps 250 \
    --pooler_type cls \
    --mlp_only_train \
    --overwrite_output_dir \
    --temp 0.05 \
    --do_ifm \
    --ifm_mode single \
    --ifm_type n- \
    --ifm_logit_type constant \
    --margin 0.01 \
    --do_train \
    --do_eval \
    --fp16 \
    "$@"

Evaluation

You can run the commands below for evaluation after using the repo to train a model:

python evaluation.py \
    --model_name_or_path <output_model_dir> \
    --pooler cls_before_pooler \
    --task_set <sts|transfer|full> \
    --mode test

For more detailed information, please check SimCSE's GitHub repo.

Citation

@inproceedings{jeong2024bootstrap,
  title={Bootstrap Your Own PLM: Boosting Semantic Features of PLMs for Unsuperivsed Contrastive Learning},
  author={Jeong, Yoo Hyun and Han, Myeong Soo and Chae, Dong-Kyu},
  booktitle={Findings of the Association for Computational Linguistics: EACL 2024},
  pages={560--569},
  year={2024}
}

myngsooo/BYOP