Paraphrase generation and detection are important tasks in Natural Language Processing (NLP), such as information retrieval, text simplification, question answering, and chatbots. The lack of comprehensive datasets in the Persian paraphrase is a major obstacle to progress in this area. Despite their importance, no large-scale corpus has been made available so far, given the difficulties in its creation and the intensive labor required. In this paper, the construction process of Degarbayan-SC uses movie subtitles. As you know, movie subtitles are in Colloquial language. It is different from formal language. To the best of our knowledge, Degarbayan-SC is the first freely released large-scale (in the order of a million words) Persian paraphrase corpus. Furthermore, this newly introduced dataset will help the growth of Persian paraphrase.
dataset | Number of pair sentences | Date modified | details |
---|---|---|---|
PPDB | 100M | Version.1 2013 Version.2 2015 | Phrasal paraphrases are extracted via bilingual pivoting |
Wiki answer | 18M | 2014 | Paired questions that the users of the wikianswer website considered similar and paired them |
MSCOCO | 500K | 2014 | Based on the annotation of 238K photos in 91 classes by 5 people |
QQP | 400K | 2017 | Based on the Kaggle competition (identifying similar questions) |
ParaNMT-50 | 50M | 2018 | Sentential paraphrase pairs are generated automatically by using neural machine translation |
ours | 1.5M | 2022 | Based on aligning sentences in hundreds of movie subtitles |
You can find the dataset under this link of Google Drive.
- Dataset is in .csv format
- our dataset has 2 columns the first column is for source sentences and the second is for targets.
Alternatively, you can also access the data through the HuggingFace🤗 datasets library. For that, you need to install datasets using this command in your terminal: (We will share it on HuggingFace after our paper is published)
pip install -q datasets
Afterward, import the Degarbayan-SC
dataset using load_dataset
:
from datasets import load_dataset
dataset = load_dataset("m0javad/Degarbayan-SC-dataset")
or you can fine-tune the model using 'transformers':
# Load model directly
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("m0javad/Degarbayan-SC")
model = AutoModelForSequenceClassification.from_pretrained("m0javad/Degarbayan-SC")
or you can test the fine-tuned model using 'pipeline':
from transformers import pipeline
pipe = pipeline("text2text-generation", model="m0javad/Degarbayan-SC")
our sentence length distribution is between 3 and 19 words and sentences are an average of 8 words. This makes sense because in the movie subtitles, sentences are shown in a range of times and we matched them with timespans. Humans can say a certain number of words in a certain period. Our collected sentences have 128,699 unique words.
Source sentence | Target sentence |
---|---|
باقی زندگی بچه هاتو تغییر میده | این تصمیم قراره زندگی بچه هاتو عوض کنه |
خب انگار که داریم انجامش میدیم | فکر کنم داریم این کارو میکنیم |
به من بگو کی پشت این جریانه | میخوای به من بگی که کی پشت همه ایناست |
بیدار شو باهات کار دارم | از اون تو بیا بیرون رفیق بهت نیاز دارم |
به هر حال این سرزمین مال اونه | بااینکه این سرزمین متعلق به اونه |
باک ما الان همه با هم گره خوردیم | باک ما همگی بهم وصل هستیم |
as you see in the table above, our dataset contains a large number of paraphrasing sentences in various forms such as syntactic, semantic, and conceptual paraphrases.
contact me for contribution and future possible works at: mjaghajani.ai@gmail.com
I would like to thank my dear teacher Dr.Keyvanrad and my colleagues, Zahra Ghasemi, Ali sadeghian