This repository contains the data, code and models for our paper JCSE: Contrastive Learning of Japanese Sentence Embeddings and Its Applications. It is built upon Pytorch and Huggingface.
We propose a novel Japanese sentence representation framework, JCSE for domain adaptation(derived from "Contrastive learning of Sentence Embeddings for Japanese"), that creates training data by generating sentences and synthesizing them with sentences available in a target domain. Specifically, a pre-trained data generator is finetuned to a target domain using our collected corpus. It is then used to generate contradictory sentence pairs that are used in contrastive learning with a two-stage training recipe for adapting a Japanese language model to a specific task in the target domain.
We recommend the following dependencies.
- Python 3.8
- Pytorch 1.9
- transformers 4.22.2
- datasets 2.3.2
- spaCy 3.3.1
- GiNZA 5.1.2
Another problem of Japanese sentence representation learning is the difficulty of evaluating existing embedding methods due to the lack of benchmark datasets. Thus, we establish a comprehensive Japanese Semantic Textual Similarity (STS) benchmark on which various embedding models are evaluated.
We use the SentEval toolkit to evaluate embedding models on the Japanese STS benchmark which has been combined in our published SentEval toolkit.
You can evaluate any Japanese sentence embedding models following the commands below:
python evaluation.py \
--model_name_or_path <your_model_dir> \
--pooler <cls|cls_before_pooler|avg|avg_top2|avg_first_last> \
--task_set <sts|transfer|full> \
--mode test
The Japanese sentence embedding models trained by us are listed as following.
Wikipedia data and JSNLI data for contrastive learning can be downloaded from here and here:
wget https://huggingface.co/datasets/MU-Kindai/datasets-for-JCSE/blob/main/wiki1m.txt
wget https://huggingface.co/datasets/MU-Kindai/datasets-for-JCSE/blob/main/nli_for_simcse.csv
The target domain corpus used in our paper can be downloaded from here and here.
You can finetune the data generator using the code referring this one.
You can generate contradictory data referring the following code from here and here.
You can download and directly use the synthetic data in target domain for contrastive learning from the following list.
Synthetic Data |
---|
clinic_domain_top4 |
clinic_domain_top5 |
clinic_domain_top6 |
education_domain_top4 |
education_domain_top5 |
education_domain_top6 |
Run train.py
. You can define different hyperparameters in your own way.
In our experiments, we use different save strategies like save steps or save epochs to save multiple checkpoints and find the best one among saved ones.
python train.py \
--model_name_or_path <your_model_dir> \
--train_file <data_dir> \
--output_dir <model_output_dir>\
--num_train_epochs <training_epoch> \
--per_device_train_batch_size 512 \
--gradient_accumulation_steps 1\
--learning_rate 1e-5 \
--max_seq_length 32 \
--save_strategy steps \
--save_steps 125 \
--pooler_type cls \
--mlp_only_train \
--overwrite_output_dir \
--hard_negative_weight 1 \
--temp 0.05 \
--do_train \
Arguments used to train our models:
Method | Arguments |
---|---|
MU-Kindai/JCSE-clinic-stage1-base | --train_file clinic_shuffle_for_simcse_top4.csv --learning_rate 5e-5 --hard_negative_weight 0 |
MU-Kindai/JCSE-clinic-final-base | --train_file nli_for_simcse.csv --learning_rate 5e-5 --hard_negative_weight 1 |
MU-Kindai/JCSE-clinic-stage1-large | --train_file clinic_shuffle_for_simcse_top5.csv --learning_rate 1e-5 --hard_negative_weight 0 |
MU-Kindai/JCSE-clinic-final-large | --train_file nli_for_simcse.csv --learning_rate 1e-5 --hard_negative_weight 1 |
MU-Kindai/JCSE-edu-stage1-base | --train_file qa_shuffle_for_simcse_top4.csv --learning_rate 5e-5 --hard_negative_weight 0 |
MU-Kindai/JCSE-edu-final-base | --train_file nli_for_simcse.csv --learning_rate 5e-5 --hard_negative_weight 1 |
MU-Kindai/JCSE-edu-stage1-large | --train_file qa_shuffle_for_simcse_top6.csv --learning_rate 1e-5 --hard_negative_weight 0 |
MU-Kindai/JCSE-edu-final-large | --train_file nli_for_simcse.csv --learning_rate 1e-5 --hard_negative_weight 1 |
For the clinic domain STS tasks in our paper, you can evaluate the embedding models following the commands below:
python evaluation.py \
--model_name_or_path <your_model_dir> \
--pooler avg \
--task_set transfer \
--mode test
For the education domain information retrieval tasks in our paper, you can evaluate the embedding models following the commands below:
cd QAbot_task_eva
python main.py\
--model_name_or_path <your_model_dir>
For the relevant content words experiments in our paper, you can check and refer the codes and examples from here.
If this work is helpful, please cite the following paper:
@article{Chen2023JSCE,
author={Chen, Zihao and Handa, Hisashi and Shirahama, Kimiaki}.
title={JCSE: Contrastive Learning of Japanese Sentence Embeddings and Its Applications},
journal={arXiv e-prints 10.48550/arXiv.2301.08193},
year={2023},
}