Code for our ACL 2021 paper - ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer
torch==1.6.0
cudatoolkit==10.0.103
cudnn==7.6.5
sentence-transformers==0.3.9
transformers==3.4.0
tensorboardX==2.1
pandas==1.1.5
sentencepiece==0.1.85
matplotlib==3.4.1
apex==0.1.0
To install apex, run:
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir ./
- Download pre-trained language model (e.g. bert-base-uncased) to folder
./bert-base-uncased
from HuggingFace's Library - Download STS datasets to
./data
folder by runningcd data && bash get_transfer_data.bash
. The script is modified from SentEval toolkit - Run the scripts in the folder
./scripts
to reproduce our experiments. For example, run the following script to train unsupervised consert-base:bash scripts/unsup-consert-base.sh
- Download the pre-trained language model (e.g. chinese-roberta-wwm-ext) to folder
./chinese-roberta-wwm-ext
from HuggingFace's Library - Download the chinese STS datasets to
./data
folder by runningcd data && bash get_chinese_sts_data.bash
- Run the scripts in the folder
./scripts/chinese
to train models for chinese STS tasks. For example, run the following script to train base model foratec_ccks
task:Thebash scripts/unsup-consert-base-atec_ccks.sh
--chinese_dataset
option inmain.py
is used to select which chinese STS dataset to use.
ID | Model | STS12 | STS13 | STS14 | STS15 | STS16 | STSb | SICK-R | Avg. |
---|---|---|---|---|---|---|---|---|---|
- | bert-base-uncased (baseline) | 35.20 | 59.53 | 49.37 | 63.39 | 62.73 | 48.18 | 58.60 | 53.86 |
- | bert-large-uncased (baseline) | 33.06 | 57.64 | 47.95 | 55.83 | 62.42 | 49.66 | 53.87 | 51.49 |
1 | unsup-consert-base [Google Drive] [百度云q571] | 64.64 | 78.49 | 69.07 | 79.72 | 75.95 | 73.97 | 67.31 | 72.74 |
2 | unsup-consert-large [Google Drive] [百度云9fm1] | 70.28 | 83.23 | 73.80 | 82.73 | 77.14 | 77.74 | 70.19 | 76.45 |
3 | sup-sbert-base (re-impl.) [Google Drive] [百度云msqy] | 69.93 | 76.00 | 72.15 | 78.59 | 73.53 | 76.10 | 73.01 | 74.19 |
4 | sup-sbert-large (re-impl.) [Google Drive] [百度云0oir] | 73.06 | 77.77 | 75.21 | 81.63 | 77.30 | 79.74 | 74.75 | 77.07 |
5 | sup-consert-joint-base [Google Drive] [百度云jks5] | 70.92 | 79.98 | 74.88 | 81.76 | 76.46 | 78.99 | 78.15 | 77.31 |
6 | sup-consert-joint-large [Google Drive] [百度云xua4] | 73.15 | 81.45 | 77.04 | 83.32 | 77.28 | 81.15 | 78.34 | 78.82 |
7 | sup-consert-sup-unsup-base [Google Drive] [百度云5mc8] | 73.02 | 84.86 | 77.32 | 82.70 | 78.20 | 81.34 | 75.00 | 78.92 |
8 | sup-consert-sup-unsup-large [Google Drive] [百度云tta1] | 74.99 | 85.58 | 79.17 | 84.25 | 80.19 | 83.17 | 77.43 | 80.68 |
9 | sup-consert-joint-unsup-base [Google Drive] [百度云cf07] | 74.46 | 84.19 | 77.08 | 83.77 | 78.55 | 81.37 | 77.01 | 79.49 |
10 | sup-consert-joint-unsup-large [Google Drive] [百度云v5x5] | 76.93 | 85.20 | 78.69 | 85.44 | 79.34 | 82.93 | 76.71 | 80.75 |
Note:
- All the base models are trained from
bert-base-uncased
and the large models are trained frombert-large-uncased
. - For the unsupervised transfer, we merge all unlabeled texts from 7 STS datasets (STS12-16, STSbenchmark and SICK-Relatedness) as the training data (total 89192 sentences), and use the STSbenchmark dev split (including 1500 human-annotated sentence pairs) to select the best checkpoint.
- The sentence representations are obtained by averaging the token embeddings at the last two layers of BERT.
- For model 2 to 10, we re-trained them on a single GeForce RTX 3090 with pytorch 1.8.1 and cuda 11.1 (rather than V100, pytorch 1.6.0 and cuda 10.0 in our initial experiments) and changed the
max_seq_length
from 64 to 40 to reduce the required GPU memory (only for large models). Consequently, the results shown here may be slightly different from those reported in our paper.
ID | Model | atec_ccks | bq | lcqmc | pawsx | stsb |
---|---|---|---|---|---|---|
- | chinese-roberta-wwm-ext (baseline) | 11.28 | 40.21 | 59.89 | 09.35 | 60.86 |
- | chinese-roberta-wwm-ext-large (baseline) | 13.75 | 36.77 | 60.36 | 09.94 | 58.72 |
1 | unsup-consert-base-atec_ccks | 27.39 | 47.61 | 60.70 | 08.17 | 64.74 |
2 | unsup-consert-base-bq | 11.55 | 47.20 | 61.47 | 08.47 | 64.44 |
3 | unsup-consert-base-lcqmc | 04.57 | 38.15 | 67.34 | 08.80 | 67.78 |
4 | unsup-consert-base-pawsx | 08.66 | 38.35 | 61.97 | 09.36 | 63.89 |
5 | unsup-consert-base-stsb | 07.29 | 40.81 | 67.39 | 06.24 | 72.82 |
6 | unsup-consert-large-atec_ccks | 29.92 | 47.50 | 59.83 | 09.72 | 66.37 |
7 | unsup-consert-large-bq | 15.47 | 47.08 | 60.10 | 10.04 | 66.94 |
8 | unsup-consert-large-lcqmc | 11.40 | 37.79 | 66.45 | 10.81 | 67.68 |
9 | unsup-consert-large-pawsx | 11.25 | 36.25 | 65.15 | 11.38 | 67.45 |
10 | unsup-consert-large-stsb | 08.00 | 40.85 | 67.72 | 10.13 | 74.50 |
Note:
- All the base models are trained from
hfl/chinese-roberta-wwm-ext
and the large models are trained fromhfl/chinese-roberta-wwm-ext-large
. - For each model, we train it on single chinese STS dataset but evaluate it on all 5 datasets. The bold numbers indicate that the model is evaluated on the same dataset it trained on.
- The sentence representations are also obtained by averaging the token embeddings at the last two layers of BERT.
- For base models, we set
batch_size
to 96 andmax_seq_length
to 64, while for large models, we setbatch_size
to 32 andmax_seq_length
to 40 to reduce the required GPU memory.
@article{yan2021consert,
title={ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer},
author={Yan, Yuanmeng and Li, Rumei and Wang, Sirui and Zhang, Fuzheng and Wu, Wei and Xu, Weiran},
journal={arXiv preprint arXiv:2105.11741},
year={2021}
}