π· Korean SentenceBERT : Sentence Embeddings using Siamese BERT-Networks using ETRI KoBERT and kakaobrain KorNLU dataset
- ETRI KorBERTλ transformers 2.4.1 ~ 2.8.0μμλ§ λμνκ³ Sentence-BERTλ 3.1.0 λ²μ μ΄μμμ λμνμ¬ λΌμ΄λΈλ¬λ¦¬λ₯Ό μμ νμμ΅λλ€.
- huggingface transformer, sentence transformers, tokenizers λΌμ΄λΈλ¬λ¦¬ μ½λλ₯Ό μ§μ μμ νλ―λ‘ κ°μνκ²½ μ¬μ©μ κΆμ₯ν©λλ€.
- μ¬μ©ν Docker imageλ Docker Hubμ 첨λΆν©λλ€.
- ETRI KoBERTλ₯Ό μ¬μ©νμ¬ νμ΅νμκ³ λ³Έ λ νμ§ν 리μμ ETRI KoBERTλ₯Ό μ 곡νμ§ μμ΅λλ€.
- SKT KoBERTλ₯Ό μ¬μ©ν λ²μ μ λ€μ λ νμ§ν 리μ 곡κ°λμ΄ μμ΅λλ€.
git clone https://github.com/BM-K/KoSentenceBERT.git
python -m venv .KoSBERT
. .KoSBERT/bin/activate
pip install -r requirements.txt
- transformer, tokenizers, sentence_transformers λλ ν 리λ₯Ό .KoSBERT/lib/python3.7/site-packages/ λ‘ μ΄λν©λλ€.
- ETRI_KoBERT λͺ¨λΈκ³Ό tokenizerκ° KoSentenceBERT λλ ν 리 μμ μ‘΄μ¬νμ¬μΌ ν©λλ€.
- ETRI λͺ¨λΈκ³Ό tokenizerλ λ€μ μμμ κ°μ΄ λΆλ¬μ΅λλ€ :
from ETRI_tok.tokenization_etri_eojeol import BertTokenizer
self.auto_model = BertModel.from_pretrained('./ETRI_KoBERT/003_bert_eojeol_pytorch')
self.tokenizer = BertTokenizer.from_pretrained('./ETRI_KoBERT/003_bert_eojeol_pytorch/vocab.txt', do_lower_case=False)
- λͺ¨λΈ νμ΅μ μνμλ©΄ KoSentenceBERT λλ ν 리 μμ KorNLUDatasetsμ΄ μ‘΄μ¬νμ¬μΌ ν©λλ€.
- STS νμ΅ μ λͺ¨λΈ ꡬ쑰μ λ§κ² λ°μ΄ν°λ₯Ό μμ νμ¬ μ¬μ©νμμΌλ©°, λ°μ΄ν°μ νμ΅ λ°©λ²μ μλμ κ°μ΅λλ€ :
KoSentenceBERT/KorNLUDatasets/KorSTS/tune_test.tsv
STS test λ°μ΄ν°μ μ μΌλΆ
python training_nli.py # NLI λ°μ΄ν°λ‘λ§ νμ΅
python training_sts.py # STS λ°μ΄ν°λ‘λ§ νμ΅
python con_training_sts.py # NLI λ°μ΄ν°λ‘ νμ΅ ν STS λ°μ΄ν°λ‘ Fine-Tuning
pooling modeλ MEAN-strategyλ₯Ό μ¬μ©νμμΌλ©°, νμ΅μ λͺ¨λΈμ output λλ ν 리μ μ μ₯ λ©λλ€.
λλ ν 리 | νμ΅λ°©λ² |
---|---|
training_nli_ETRI_KoBERT-003_bert_eojeol | Only Train NLI |
training_sts_ETRI_KoBERT-003_bert_eojeol | Only Train STS |
training_nli_sts_ETRI_KoBERT-003_bert_eojeol | STS + NLI |
Seed κ³ μ , test set
Model | Cosine Pearson | Cosine Spearman | Euclidean Pearson | Euclidean Spearman | Manhattan Pearson | Manhattan Spearman | Dot Pearson | Dot Spearman |
---|---|---|---|---|---|---|---|---|
NLl | 67.96 | 70.45 | 71.06 | 70.48 | 71.17 | 70.51 | 64.87 | 63.04 |
STS | 80.43 | 79.99 | 78.18 | 78.03 | 78.13 | 77.99 | 73.73 | 73.40 |
STS + NLI | 80.10 | 80.42 | 79.14 | 79.28 | 79.08 | 79.22 | 74.46 | 74.16 |
- μμ± λ λ¬Έμ₯ μλ² λ©μ λ€μ΄ μ€νΈλ¦Ό μ ν리μΌμ΄μ μ μ¬μ©ν μ μλ λ°©λ²μ λν λͺ κ°μ§ μλ₯Ό μ μν©λλ€.
- STS + NLI pretrained λͺ¨λΈμ ν΅ν΄ μ§νν©λλ€.
SemanticSearch.pyλ μ£Όμ΄μ§ λ¬Έμ₯κ³Ό μ μ¬ν λ¬Έμ₯μ μ°Ύλ μμ
μ
λλ€.
λ¨Όμ Corpusμ λͺ¨λ λ¬Έμ₯μ λν μλ² λ©μ μμ±ν©λλ€.
from sentence_transformers import SentenceTransformer, util
import numpy as np
model_path = './output/training_nli_sts_ETRI_KoBERT-003_bert_eojeol'
embedder = SentenceTransformer(model_path)
# Corpus with example sentences
corpus = ['ν λ¨μκ° μμμ λ¨Ήλλ€.',
'ν λ¨μκ° λΉ΅ ν μ‘°κ°μ λ¨Ήλλ€.',
'κ·Έ μ¬μκ° μμ΄λ₯Ό λλ³Έλ€.',
'ν λ¨μκ° λ§μ νλ€.',
'ν μ¬μκ° λ°μ΄μ¬λ¦°μ μ°μ£Όνλ€.',
'λ λ¨μκ° μλ λ₯Ό μ² μμΌλ‘ λ°μλ€.',
'ν λ¨μκ° λ΄μΌλ‘ μΈμΈ λ
μμ λ°±λ§λ₯Ό νκ³ μλ€.',
'μμμ΄ ν λ§λ¦¬κ° λλΌμ μ°μ£Όνλ€.',
'μΉν ν λ§λ¦¬κ° λ¨Ήμ΄ λ€μμ λ¬λ¦¬κ³ μλ€.']
corpus_embeddings = embedder.encode(corpus, convert_to_tensor=True)
# Query sentences:
queries = ['ν λ¨μκ° νμ€νλ₯Ό λ¨Ήλλ€.',
'κ³ λ¦΄λΌ μμμ μ
μ λκ΅°κ°κ° λλΌμ μ°μ£Όνκ³ μλ€.',
'μΉνκ° λ€νμ κ°λ‘ μ§λ¬ λ¨Ήμ΄λ₯Ό μ«λλ€.']
# Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity
top_k = 5
for query in queries:
query_embedding = embedder.encode(query, convert_to_tensor=True)
cos_scores = util.pytorch_cos_sim(query_embedding, corpus_embeddings)[0]
cos_scores = cos_scores.cpu()
#We use np.argpartition, to only partially sort the top_k results
top_results = np.argpartition(-cos_scores, range(top_k))[0:top_k]
print("\n\n======================\n\n")
print("Query:", query)
print("\nTop 5 most similar sentences in corpus:")
for idx in top_results[0:top_k]:
print(corpus[idx].strip(), "(Score: %.4f)" % (cos_scores[idx]))
κ²°κ³Όλ λ€μκ³Ό κ°μ΅λλ€ :
========================
Query: ν λ¨μκ° νμ€νλ₯Ό λ¨Ήλλ€.
Top 5 most similar sentences in corpus:
ν λ¨μκ° μμμ λ¨Ήλλ€. (Score: 0.7557)
ν λ¨μκ° λΉ΅ ν μ‘°κ°μ λ¨Ήλλ€. (Score: 0.6464)
ν λ¨μκ° λ΄μΌλ‘ μΈμΈ λ
μμ λ°±λ§λ₯Ό νκ³ μλ€. (Score: 0.2565)
ν λ¨μκ° λ§μ νλ€. (Score: 0.2333)
λ λ¨μκ° μλ λ₯Ό μ² μμΌλ‘ λ°μλ€. (Score: 0.1792)
========================
Query: κ³ λ¦΄λΌ μμμ μ
μ λκ΅°κ°κ° λλΌμ μ°μ£Όνκ³ μλ€.
Top 5 most similar sentences in corpus:
μμμ΄ ν λ§λ¦¬κ° λλΌμ μ°μ£Όνλ€. (Score: 0.6732)
μΉν ν λ§λ¦¬κ° λ¨Ήμ΄ λ€μμ λ¬λ¦¬κ³ μλ€. (Score: 0.3401)
λ λ¨μκ° μλ λ₯Ό μ² μμΌλ‘ λ°μλ€. (Score: 0.1037)
ν λ¨μκ° μμμ λ¨Ήλλ€. (Score: 0.0617)
κ·Έ μ¬μκ° μμ΄λ₯Ό λλ³Έλ€. (Score: 0.0466)
=======================
Query: μΉνκ° λ€νμ κ°λ‘ μ§λ¬ λ¨Ήμ΄λ₯Ό μ«λλ€.
Top 5 most similar sentences in corpus:
μΉν ν λ§λ¦¬κ° λ¨Ήμ΄ λ€μμ λ¬λ¦¬κ³ μλ€. (Score: 0.7164)
λ λ¨μκ° μλ λ₯Ό μ² μμΌλ‘ λ°μλ€. (Score: 0.3216)
μμμ΄ ν λ§λ¦¬κ° λλΌμ μ°μ£Όνλ€. (Score: 0.2071)
ν λ¨μκ° λΉ΅ ν μ‘°κ°μ λ¨Ήλλ€. (Score: 0.1089)
ν λ¨μκ° μμμ λ¨Ήλλ€. (Score: 0.0724)
Clustering.pyλ λ¬Έμ₯ μλ² λ© μ μ¬μ±μ κΈ°λ°μΌλ‘ μ μ¬ν λ¬Έμ₯μ ν΄λ¬μ€ν°λ§νλ μλ₯Ό 보μ¬μ€λλ€.
μ΄μ κ³Ό λ§μ°¬κ°μ§λ‘ λ¨Όμ κ° λ¬Έμ₯μ λν μλ² λ©μ κ³μ°ν©λλ€.
from sentence_transformers import SentenceTransformer, util
import numpy as np
model_path = './output/training_nli_sts_ETRI_KoBERT-003_bert_eojeol'
embedder = SentenceTransformer(model_path)
# Corpus with example sentences
corpus = ['ν λ¨μκ° μμμ λ¨Ήλλ€.',
'ν λ¨μκ° λΉ΅ ν μ‘°κ°μ λ¨Ήλλ€.',
'κ·Έ μ¬μκ° μμ΄λ₯Ό λλ³Έλ€.',
'ν λ¨μκ° λ§μ νλ€.',
'ν μ¬μκ° λ°μ΄μ¬λ¦°μ μ°μ£Όνλ€.',
'λ λ¨μκ° μλ λ₯Ό μ² μμΌλ‘ λ°μλ€.',
'ν λ¨μκ° λ΄μΌλ‘ μΈμΈ λ
μμ λ°±λ§λ₯Ό νκ³ μλ€.',
'μμμ΄ ν λ§λ¦¬κ° λλΌμ μ°μ£Όνλ€.',
'μΉν ν λ§λ¦¬κ° λ¨Ήμ΄ λ€μμ λ¬λ¦¬κ³ μλ€.',
'ν λ¨μκ° νμ€νλ₯Ό λ¨Ήλλ€.',
'κ³ λ¦΄λΌ μμμ μ
μ λκ΅°κ°κ° λλΌμ μ°μ£Όνκ³ μλ€.',
'μΉνκ° λ€νμ κ°λ‘ μ§λ¬ λ¨Ήμ΄λ₯Ό μ«λλ€.']
corpus_embeddings = embedder.encode(corpus)
# Then, we perform k-means clustering using sklearn:
from sklearn.cluster import KMeans
num_clusters = 5
clustering_model = KMeans(n_clusters=num_clusters)
clustering_model.fit(corpus_embeddings)
cluster_assignment = clustering_model.labels_
clustered_sentences = [[] for i in range(num_clusters)]
for sentence_id, cluster_id in enumerate(cluster_assignment):
clustered_sentences[cluster_id].append(corpus[sentence_id])
for i, cluster in enumerate(clustered_sentences):
print("Cluster ", i+1)
print(cluster)
print("")
κ²°κ³Όλ λ€μκ³Ό κ°μ΅λλ€ :
Cluster 1
['λ λ¨μκ° μλ λ₯Ό μ² μμΌλ‘ λ°μλ€.', 'μΉν ν λ§λ¦¬κ° λ¨Ήμ΄ λ€μμ λ¬λ¦¬κ³ μλ€.', 'μΉνκ° λ€νμ κ°λ‘ μ§λ¬ λ¨Ήμ΄λ₯Ό μ«λλ€.']
Cluster 2
['ν λ¨μκ° λ§μ νλ€.', 'ν λ¨μκ° λ΄μΌλ‘ μΈμΈ λ
μμ λ°±λ§λ₯Ό νκ³ μλ€.']
Cluster 3
['ν λ¨μκ° μμμ λ¨Ήλλ€.', 'ν λ¨μκ° λΉ΅ ν μ‘°κ°μ λ¨Ήλλ€.', 'ν λ¨μκ° νμ€νλ₯Ό λ¨Ήλλ€.']
Cluster 4
['κ·Έ μ¬μκ° μμ΄λ₯Ό λλ³Έλ€.', 'ν μ¬μκ° λ°μ΄μ¬λ¦°μ μ°μ£Όνλ€.']
Cluster 5
['μμμ΄ ν λ§λ¦¬κ° λλΌμ μ°μ£Όνλ€.', 'κ³ λ¦΄λΌ μμμ μ
μ λκ΅°κ°κ° λλΌμ μ°μ£Όνκ³ μλ€.']
@article{ham2020kornli,
title={KorNLI and KorSTS: New Benchmark Datasets for Korean Natural Language Understanding},
author={Ham, Jiyeon and Choe, Yo Joong and Park, Kyubyong and Choi, Ilji and Soh, Hyungjoon},
journal={arXiv preprint arXiv:2004.03289},
year={2020}
}
Sentence Transformers: Multilingual Sentence Embeddings using BERT / RoBERTa / XLM-RoBERTa & Co. with PyTorch
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "http://arxiv.org/abs/1908.10084",
}
@article{reimers-2020-multilingual-sentence-bert,
title = "Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation",
author = "Reimers, Nils and Gurevych, Iryna",
journal= "arXiv preprint arXiv:2004.09813",
month = "04",
year = "2020",
url = "http://arxiv.org/abs/2004.09813",
}