Thai-Sentence-Vector-Benchmark

Benchmark for Thai sentence representation on Thai STS-B, Text classification, and Retrieval datasets.

Motivation

Sentence representation plays a crucial role in NLP downstream tasks such as NLI, text classification, and STS. Recent sentence representation training techniques require NLI or STS datasets. However, no equivalent Thai NLI or STS datasets exist for sentence representation training. To address this problem, we create "Thai sentence vector benchmark" to demonstrate that we can train Thai sentence representation without any supervised datasets.

Our first preliminary results demonstrate that we can train a robust sentence representation model with an unsupervised technique called SimCSE. We show that it is possible to train SimCSE with 1.3 M sentences from Wikipedia within 2 hours on the Google Colab (V100), where the performance of SimCSE-XLM-R is similar to mDistil-BERT<-mUSE (train on > 1B sentences).

Moreover, we provide the Thai sentence vector benchmark. Our benchmark aims to evaluate the effectiveness of sentence embedding models on Thai zero-shot and transfer learning tasks. The tasks comprise of four tasks: Semantic ranking on STS-B, text classification (transfer), pair classification, and retrieval question answering (QA).

How do we train unsupervised sentence representation?

We provide simple and effective sentence embedding methods that do not require supervised labels (unsupervised learning) as follows:

SimCSE

We use SimCSE:Simple Contrastive Learning of Sentence Embeddings on multilingual LM models (mBERT, distil-mBERT, XLM-R) and a monolingual model (WangchanBERTa).
Training data: Thai Wikipedia.
Example: SimCSE-Thai.ipynb.
Training Example on Google Colab: https://colab.research.google.com/github/mrpeerat/Thai-Sentence-Vector-Benchmark/blob/main/SimCSE-Thai.ipynb

ConGen

We use the training objective from ConGen on various PLMs.
Training data: scb-mt-en-th-2020
Example: ConGen-Thai.ipynb

SCT

We use the training objective from SCT on various PLMs.
Training data: scb-mt-en-th-2020
Example: SCT-Thai.ipynb

Why do we select these techniques?

Easy to train
Compatible with every model
Does not require any annotated datasets
The best sentence representation method (for now) in terms of the performance on STS and downstream tasks (SCT outperformed ConGen and SimCSE in their paper).

What about other techniques?

We also consider other techniques (supervised and unsupervised methods) in this repository. Currently, we have various methods tested on our benchmarks, such as:

Supervised learning: sentence-bert.
Multilingual sentence representation alignment: CL-ReLKT (NAACL'22)

Thai semantic textual similarity benchmark

We use STS-B translated ver. in which we translate STS-B from SentEval by using google-translate API
How to evaluate sentence representation: Easy_Evaluation.ipynb
How to evaluate sentence representation on Google Colab: https://colab.research.google.com/github/mrpeerat/Thai-Sentence-Vector-Benchmark/blob/main/SentEval.ipynb

Base Model	Spearman's Correlation (*100)	Supervised?	Latency(ms)
simcse-model-distil-m-bert	44.27		7.22 ± 0.53
simcse-model-m-bert-thai-cased	43.95		11.66 ± 0.72
simcse-model-XLMR	63.98		10.95 ± 0.41
simcse-model-wangchanberta	60.95		10.54 ± 0.33
simcse-model-phayathaibert	68.28		11.4 ± 1.01
SCT-model-XLMR	68.90		10.52 ± 0.46
SCT-model-wangchanberta	71.35		10.61 ± 0.62
SCT-model-phayathaibert	74.06		10.64 ± 0.72
SCT-Distil-model-XLMR	78.78		10.69 ± 0.48
SCT-Distil-model-wangchanberta	77.77		10.86 ± 0.55
SCT-Distil-model-phayathaibert	77.89		11.01 ± 0.62
SCT-Distil-model-phayathaibert-bge-m3	76.71
ConGen-model-XLMR	79.69		10.79 ± 0.38
ConGen-model-wangchanberta	79.20		10.44 ± 0.5
ConGen-model-phayathaibert	78.90		10.32 ± 0.31
ConGen-BGE_M3-model-phayathaibert	76.82		10.91 ± 0.43
distiluse-base-multilingual-cased-v2	65.37	✔️	9.38 ± 1.34
paraphrase-multilingual-mpnet-base-v2	80.49	✔️	10.93 ± 0.55
BGE M-3	77.22	✔️	23.5 ± 3.07
Cohere-embed-multilingual-v2.0	68.03	✔️

Thai transfer benchmark

We use Wisesight, Wongnai, and Generated review datasets.
How to evaluate: Transfer_Evaluation

Wisesight

Base Model	Acc (*100)	F1 (*100, weighted)	Supervised?
simcse-model-distil-m-bert	56.12	56.60
simcse-model-m-bert-thai-cased	55.86	56.65
simcse-model-XLMR	62.07	62.76
simcse-model-wangchanberta	64.17	64.39
simcse-model-phayathaibert	68.59	67.73
SCT-model-XLMR	67.47	67.62
SCT-model-wangchanberta	68.51	68.97
SCT-model-phayathaibert	70.80	68.60
SCT-Distil-model-XLMR	67.73	67.75
SCT-Distil-model-wangchanberta	65.78	66.17
SCT-Distil-model-phayathaibert	66.64	66.94
SCT-Distil-model-phayathaibert-bge-m3	67.28	67.70
ConGen-model-XLMR	66.75	67.41
ConGen-model-wangchanberta	67.09	67.65
ConGen-model-phayathaibert	67.65	68.12
ConGen-BGE_M3-model-phayathaibert	68.62	68.92
distiluse-base-multilingual-cased-v2	63.31	63.74	✔️
paraphrase-multilingual-mpnet-base-v2	67.05	67.67	✔️
BGE M-3	68.36	68.92	✔️
Cohere-embed-multilingual-v2.0	67.13	67.53	✔️

Wongnai

Base Model	Acc (*100)	F1 (*100, weighted)	Supervised?
simcse-model-distil-m-bert	34.31	35.81
simcse-model-m-bert-thai-cased	37.55	38.29
simcse-model-XLMR	40.46	38.06
simcse-model-wangchanberta	40.95	37.58
simcse-model-phayathaibert	37.53	38.45
SCT-model-XLMR	42.88	44.75
SCT-model-wangchanberta	47.90	47.23
SCT-model-phayathaibert	54.73	49.48
SCT-Distil-model-XLMR	46.16	47.02
SCT-Distil-model-wangchanberta	48.61	44.89
SCT-Distil-model-phayathaibert	48.86	48.14
SCT-Distil-model-phayathaibert-bge-m3	45.95	47.29
ConGen-model-XLMR	44.95	46.57
ConGen-model-wangchanberta	46.72	48.04
ConGen-model-phayathaibert	45.99	47.54
ConGen-BGE_M3-model-phayathaibert	47.98	49.22
distiluse-base-multilingual-cased-v2	37.76	40.07	✔️
paraphrase-multilingual-mpnet-base-v2	45.20	46.72	✔️
BGE M-3	51.94	52.68	✔️
Cohere-embed-multilingual-v2.0	xx.xx	xx.xx	✔️

Generated Review

Base Model	Acc (*100)	F1 (*100, weighted)	Supervised?
simcse-model-distil-m-bert	39.11	37.27
simcse-model-m-bert-thai-cased	38.72	37.56
simcse-model-XLMR	46.27	44.22
simcse-model-wangchanberta	37.37	36.72
simcse-model-phayathaibert	48.76	45.14
SCT-model-XLMR	55.93	54.19
SCT-model-wangchanberta	50.39	48.65
SCT-model-phayathaibert	54.90	48.36
SCT-Distil-model-XLMR	56.76	55.50
SCT-Distil-model-wangchanberta	52.33	48.41
SCT-Distil-model-phayathaibert	54.35	52.23
SCT-Distil-model-phayathaibert-bge-m3	58.95	57.64
ConGen-model-XLMR	57.93	56.66
ConGen-model-wangchanberta	58.67	57.51
ConGen-model-phayathaibert	58.43	57.23
ConGen-BGE_M3-model-phayathaibert	59.66	58.37
distiluse-base-multilingual-cased-v2	50.62	48.90	✔️
paraphrase-multilingual-mpnet-base-v2	57.48	56.35	✔️
BGE M-3	59.53	58.35	✔️
Cohere-embed-multilingual-v2.0	xx.xx	xx.xx	✔️

Thai pair classification benchmark

We use XNLI dev and test set. We drop neutral classes and change from contradiction => 0 and entailment =>1.
We use the average precision score as the main metric.
How to evaluate: XNLI_evaluation.ipynb

Base Model	Dev (AP)	Test (AP)	Supervised?
simcse-model-distil-m-bert	57.99	56.06
simcse-model-m-bert-thai-cased	58.41	58.09
simcse-model-XLMR	62.05	62.05
simcse-model-wangchanberta	58.13	59.01
simcse-model-phayathaibert	62.10	63.34
SCT-model-XLMR	64.53	65.29
SCT-model-wangchanberta	66.36	66.79
SCT-model-phayathaibert	65.35	65.84
SCT-Distil-model-XLMR	78.40	79.14
SCT-Distil-model-wangchanberta	77.06	76.75
SCT-Distil-model-phayathaibert	77.95	77.61
SCT-Distil-model-phayathaibert-bge-m3	75.18	74.83
ConGen-model-XLMR	80.68	80.98
ConGen-model-wangchanberta	82.24	81.15
ConGen-model-phayathaibert	80.89	80.51
ConGen-BGE_M3-model-phayathaibert	76.72	76.13
distiluse-base-multilingual-cased-v2	65.35	64.93	✔️
paraphrase-multilingual-mpnet-base-v2	84.14	84.06	✔️
BGE M-3	79.09	79.02	✔️
Cohere-embed-multilingual-v2.0	60.25	61.15	✔️

Thai retrieval benchmark

We use XQuAD, MIRACL, and TyDiQA datasets.
How to evaluate: Retrieval_Evaluation

XQuAD

Base Model	R@1	MRR@10	Supervised?	Latency(second)
simcse-model-distil-m-bert	18.24	27.19		0.61
simcse-model-m-bert-thai-cased	22.94	30.29		1.02
simcse-model-XLMR	52.02	62.94		0.85
simcse-model-wangchanberta	53.87	65.51		0.81
simcse-model-phayathaibert	73.95	81.67		0.79
SCT-model-XLMR	55.29	65.23		1.24
SCT-model-wangchanberta	66.30	76.14		1.23
SCT-model-phayathaibert	67.56	76.14		1.19
SCT-Distil-model-XLMR	68.91	78.19		1.24
SCT-Distil-model-wangchanberta	62.27	72.53		1.35
SCT-Distil-model-phayathaibert	71.43	80.18		1.21
SCT-Distil-model-phayathaibert-bge-m3	80.50	86.75
ConGen-model-XLMR	71.76	80.01		1.24
ConGen-model-wangchanberta	70.92	79.59		1.21
ConGen-model-phayathaibert	71.85	80.33		1.19
ConGen-BGE_M3-model-phayathaibert	85.80	90.48		1.3
distiluse-base-multilingual-cased-v2	49.16	58.19	✔️	1.05
paraphrase-multilingual-mpnet-base-v2	71.26	79.63	✔️	1.24
BGE M-3	90.50	94.33	✔️	7.22
Cohere-embed-multilingual-v2.0	82.52	87.78	✔️	XXX

MIRACL

Base Model	R@1	MRR@10	Supervised?	Latency(second)
simcse-model-distil-m-bert	28.51	37.05		4.31
simcse-model-m-bert-thai-cased	26.19	36.11		6.66
simcse-model-XLMR	34.92	47.51		6.17
simcse-model-wangchanberta	36.29	48.96		6.09
simcse-model-phayathaibert	43.25	57.28		6.18
SCT-model-XLMR	28.51	40.84		16.29
SCT-model-wangchanberta	35.33	48.19		16.0
SCT-model-phayathaibert	37.52	51.02		15.8
SCT-Distil-model-XLMR	40.38	51.68		16.17
SCT-Distil-model-wangchanberta	39.43	50.61		16.04
SCT-Distil-model-phayathaibert	45.16	56.52		15.82
SCT-Distil-model-phayathaibert-bge-m3	64.80	74.46
ConGen-model-XLMR	43.11	55.51		16.4
ConGen-model-wangchanberta	41.06	53.31		15.98
ConGen-model-phayathaibert	44.34	55.77		15.97
ConGen-BGE_M3-model-phayathaibert	70.40	79.33		15.83
distiluse-base-multilingual-cased-v2	17.74	27.78	✔️	9.84
paraphrase-multilingual-mpnet-base-v2	38.20	49.65	✔️	16.22
BGE M-3	79.67	86.68	✔️	91.27
Cohere-embed-multilingual-v2.0	66.98	77.58	✔️	XXX

TyDiQA

Base Model	R@1	MRR@10	Supervised?	Latency(second)
simcse-model-distil-m-bert	44.69	51.39		1.6
simcse-model-m-bert-thai-cased	45.09	52.37		2.46
simcse-model-XLMR	58.06	64.72		2.35
simcse-model-wangchanberta	62.65	70.02		2.32
simcse-model-phayathaibert	71.43	78.16		2.28
SCT-model-XLMR	49.28	58.62		3.15
SCT-model-wangchanberta	58.19	68.05		3.21
SCT-model-phayathaibert	63.43	71.73		3.21
SCT-Distil-model-XLMR	56.36	65.18		3.3
SCT-Distil-model-wangchanberta	56.23	65.18		3.18
SCT-Distil-model-phayathaibert	58.32	67.42		3.21
SCT-Distil-model-phayathaibert-bge-m3	78.37	84.01
ConGen-model-XLMR	60.29	68.56		3.28
ConGen-model-wangchanberta	59.11	67.42		3.19
ConGen-model-phayathaibert	59.24	67.69		3.15
ConGen-BGE_M3-model-phayathaibert	83.36	88.29		3.14
distiluse-base-multilingual-cased-v2	32.50	42.20	✔️	2.05
paraphrase-multilingual-mpnet-base-v2	54.39	63.12	✔️	3.16
BGE M-3	89.12	93.43	✔️	20.87
Cohere-embed-multilingual-v2.0	85.45	90.33	✔️	XXX

Thank you for the many codes from

Acknowledgments:

Can: proofread
Charin: proofread + idea

c4n/Thai-Sentence-Vector-Benchmark