/SCT

SCT: An Efficient Self-Supervised Cross-View Training For Sentence Embedding (TACL)

Primary LanguagePythonMIT LicenseMIT

SCT

Implementation of An Efficient Self-Supervised Cross-View Training For Sentence Embedding (TACL 2023).

Citation

@article{10.1162/tacl_a_00620,
    author = {Limkonchotiwat, Peerat and Ponwitayarat, Wuttikorn and Lowphansirikul, Lalita and Udomcharoenchaikit, Can and Chuangsuwanich, Ekapol and Nutanong, Sarana},
    title = "{An Efficient Self-Supervised Cross-View Training For Sentence Embedding}",
    journal = {Transactions of the Association for Computational Linguistics},
    volume = {11},
    pages = {1572-1587},
    year = {2023},
    month = {12},
    issn = {2307-387X},
    doi = {10.1162/tacl_a_00620},
    url = {https://doi.org/10.1162/tacl\_a\_00620},
    eprint = {https://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl\_a\_00620/2196817/tacl\_a\_00620.pdf},
}

Installation

git clone https://github.com/mrpeerat/SCT
cd SCT
pip install -e .

Our models (Huggingface)

Self-supervised

Distillation

Usage

Training data

We use the training data from BSL's paper: here.

Development data

We use sts-b development set from sentence transformer.

Parameters

Self-supervised:

Models Reference Temp Student Temp Queue Size Learning Rate
BERT-Tiny 0.03 0.04 131072 5e-4
BERT-Mini 0.01 0.03 131072 3e-4
BERT-Small 0.02 0.03 65536 3e-4
BERT-Base 0.04 0.05 65536 5e-4
BERT-Large 0.04 0.05 16384 5e-4

Distillation:

Models Reference Temp Student Temp Queue Size Learning Rate
BERT-Tiny 0.03 0.04 131072 5e-4
BERT-Mini 0.04 0.05 65536 1e-4
BERT-Small 0.04 0.05 131072 1e-4
BERT-Base 0.04 0.05 65536 1e-4

Train your own model

Please set the model's parameter before training.

>> bash Running_distillation_script.sh
>> bash Running_script.sh

For finetuning model parameters:

learning_rate_all=(1e-4 3e-4 5e-4)
queue_sizes=(131072 65536 16384)
teacher_temps=(0.01 0.02 0.03 0.04 0.05 0.06 0.07)
student_temps=(0.01 0.02 0.03 0.04 0.05 0.06 0.07)

Evaluation

Our evaluation code for sentence embeddings is based on a modified version of SentEval and SimCSE.

Before evaluation, please download the evaluation datasets by running

cd SentEval
pip install -e .
cd data/downstream/
bash download_dataset.sh

Evaluation - Notebook

Please see this notebooks.

Evaluation - Python

python evaluation.py \
    --model_name_or_path "your-model-path" \
    --task_set sts \
    --mode test

Main results - STS

Self-supervised:

Models STS (Avg.)
SCT-BERT-Tiny 69.73
SCT-BERT-Mini 69.59
SCT-BERT-Small 72.56
SCT-BERT-Base 75.55
SCT-BERT-Large 78.16

Distillation:

Models STS (Avg.)
SCT-Distillation-BERT-Tiny 76.43
SCT-Distillation-BERT-Mini 77.58
SCT-Distillation-BERT-Small 78.16
SCT-Distillation-BERT-Base 79.58

Downstream tasks - Reranking and NLI

  • For the reranking evaluation code, we use USEB
  • For the NLI evaluation code, we use SentEval

Self-supervised:

Models Reranking (Avg.) NLI (Avg.)
SCT-BERT-Tiny 55.29 71.89
SCT-BERT-Small 58.59 75.70
SCT-BERT-Base 60.97 77.93
SCT-BERT-Large 63.02 79.55

Distillation:

Models Reranking (Avg.) NLI (Avg.)
SCT-Distillation-BERT-Tiny 61.14 78.53
SCT-Distillation-BERT-Small 61.94 80.44
SCT-Distillation-BERT-Base 64.63 80.97