Vietnamese Datasets

Keywords: Vietnamese datasets, Vietnamese corpora, Vietnamese corpus, Vietnamese resources.

UIT-ViQuAD (version 1.0) - A Vietnamese Dataset for Evaluating Machine Reading Comprehension. Bộ Dữ liệu Đọc hiểu Tự động cho Tiếng Việt.

Abstract: Over 97 million people speak Vietnamese as their native language in the world. However, there are few research studies on machine reading comprehension (MRC) for Vietnamese, the task of understanding a text and answering questions related to it. Due to the lack of benchmark datasets for Vietnamese, we present the Vietnamese Question Answering Dataset (UIT-ViQuAD), a new dataset for the low-resource language as Vietnamese to evaluate MRC models. This dataset comprises over 23,000 human-generated question-answer pairs based on 5,109 passages of 174 Vietnamese articles from Wikipedia. In particular, we propose a new process of dataset creation for Vietnamese MRC. Our in-depth analyses illustrate that our dataset requires abilities beyond simple reasoning like word matching and demands single-sentence and multiple-sentence inferences. Besides, we conduct experiments on state-of-the-art MRC methods for English and Chinese as the first experimental models on UIT-ViQuAD. We also estimate human performance on the dataset and compare it to the experimental results of powerful machine learning models. As a result, the substantial differences between human performance and the best model performance on the dataset indicate that improvements can be made on UIT-ViQuAD in future research. Our dataset is freely available on our website to encourage the research community to overcome challenges in Vietnamese MRC. Cross-Lingual Machine Reading Comprehension: SQuAD (for English), UIT-ViQuAD (for Vietnamese), KorQuAD (for Korean), FQuAD (for French), and SberQuAD (for Russian).

Paper: Kiet Van Nguyen, Duc-Vu Nguyen, Anh Gia-Tuan Nguyen, Ngan Luu-Thuy Nguyen. A Vietnamese Dataset for Evaluating Machine Reading Comprehension. COLING 2020.

Please contact us via email: kietnv@uit.edu.vn (Mr. Kiet Nguyen) to sign the corpus user agreement and then receive the corpus.

UIT-ViNewsQA: New Vietnamese Corpus for Machine Reading Comprehension of Health News Articles

Abstract: Large-scale and high-quality corpora are really necessary for evaluating machine reading comprehension models on the low-resource language like Vietnamese. In addition, machine reading comprehension for the health domain offers great potential for practical applications; however, there is still very little machine reading comprehension research in this domain. In this study, we present UIT-ViNewsQA as a new corpus for the Vietnamese language to evaluate models of healthcare reading comprehension. The corpus comprises 22,077 human-generated question--answer pairs. Crowd-workers create the questions and their answers based on a set of over 4,419 online Vietnamese healthcare news articles, where the answers comprised spans extracted from the corresponding articles. In particular, we develop a process of creating a corpus for the Vietnamese machine reading comprehension. Comprehensive evaluations demonstrated that our corpus requires abilities beyond simple reasoning such as word matching, as well as demanding difficult reasoning similar to inferences based on single-or-multiple-sentence information. We conduct experiments using state-of-the-art methods for machine reading comprehension to obtain the first baseline performance measures, which will be compared with further models' performances. We measure human performance based on the corpus and compared it with several strong neural network-based models. Our experiments showed that the best model was BERT, which achieved an exact match score of 57.57% and F1-score of 76.90% on our corpus. The significant difference between humans and the best model (F1-score of 15.93%) on the test set of our corpus indicates that improvements in UIT-ViNewsQA can be explored in future research. Our corpus is freely available on our website in order to encourage the research community to make these improvements.

Paper: Kiet Van Nguyen, Duc-Vu Nguyen, Anh Gia-Tuan Nguyen, Ngan Luu-Thuy Nguyen. New Vietnamese Corpus for Machine ReadingComprehension of Health News Articles.

Please contact us via email: kietnv@uit.edu.vn (Mr. Kiet Nguyen) to sign the corpus user agreement and then receive the corpus.

ViMMRC (version 1.0) - Vietnamese Multiple-choice Machine Reading Comprehension Corpus

Abstract: Machine Reading Comprehension (MRC) is the task of natural language processing that studies the ability to read and understand unstructured texts and then find the correct answers for questions. Until now, we have not yet had any MRC dataset for such a low-resource language as Vietnamese. In this paper, we introduce ViMMRC, a challenging machine comprehension corpus with multiple-choice questions, intended for research on the machine comprehension of Vietnamese text. This corpus includes 2,783 multiple-choice questions and answers based on a set of 417 Vietnamese texts used for teaching reading comprehension for 1st to 5th graders. Answers may be extracted from the contents of single or multiple sentences in the corresponding reading text. A thorough analysis of the corpus and experimental results in this paper illustrate that our corpus ViMMRC demands reasoning abilities beyond simple word matching. We proposed the method of Boosted Sliding Window (BSW) that improves 5.51% in accuracy over the best baseline method. We also measured human performance on the corpus and compared it to our MRC models. The performance gap between humans and our best experimental model indicates that significant progress can be made on Vietnamese machine reading comprehension in further research. The corpus is freely available at our website for research purposes.

Paper: Kiet Van Nguyen, Khiem Vinh Tran, Son T. Luu, Anh Gia-Tuan Nguyen, Ngan Luu-Thuy Nguyen, Enhancing lexical-based approach with external knowledge for Vietnamese multiple-choice reading comprehension.

Please contact us via email: kietnv@uit.edu.vn (Mr. Kiet Nguyen) to sign the corpus user agreement and then receive the corpus.

UIT-VSFC (version 1.0) - Vietnamese Students’ Feedback Corpus

Abstract: Students’ feedback is a vital resource for the interdisciplinary research involving the combining of two different research fields between sentiment analysis and education. Vietnamese Students’ Feedback Corpus (UIT-VSFC) is the resource consists of over 16,000 sentences which are human-annotated with two different tasks: sentiment-based and topic-based classifications. To assess the quality of our corpus, we measure the annotator agreements and classification evaluation on the UIT-VSFC corpus. As a result, we obtained the inter-annotator agreement of sentiments and topics with more than over 91% and 71% respectively. In addition, we built the baseline model with the Maximum Entropy classifier and achived approximately 88% of the sentiment F1-score and over 84% of the topic F1-score.

Paper: Kiet Van Nguyen, Vu Duc Nguyen, Phu Xuan-Vinh Nguyen, Tham Thi-Hong Truong, Ngan Luu-Thuy Nguyen, UIT-VSFC: Vietnamese Students' Feedback Corpus for Sentiment Analysis, 2018 10th International Conference on Knowledge and Systems Engineering (KSE 2018), November 1-3, 2018, Ho Chi Minh City, Vietnam.

Please contact us via email: kietnv@uit.edu.vn (Mr. Kiet Nguyen) to sign the corpus user agreement and then receive the corpus.

UIT-VSMEC (version 1.0) - Vietnamese Social Media Emotion Corpus

Abstract: Emotion recognition is a higher approach or special case of sentiment analysis. In this task, the result is not produced in terms of either polarity: positive or negative or in the form of rating (from 1 to 5) but of a more detailed level of sentiment analysis in which the result are depicted in more expressions like sadness, enjoyment, anger, disgust, fear and surprise. Emotion recognition plays a critical role in measuring brand value of a product by recognizing specific emotions of customers’ comments. In this study, we have achieved two targets. First and foremost, we built a standard Vietnamese Social Media Emotion Corpus (UIT-VSMEC) with about 6,927 human-annotated sentences with six emotion labels, contributing to emotion recognition research in Vietnamese which is a low-resource language in Natural Language Processing (NLP). Secondly, we assessed and measured machine learning and deep neural network models on our UIT-VSMEC. As a result, Convolutional Neural Network (CNN) model achieved the highest performance with 57.61% of F1-score.

Paper: Vong Ho, Duong Nguyen, Danh Nguyen, Linh Pham, Kiet Nguyen and Ngan Nguyen, Emotion Recognition for Vietnamese Social Media Text, 2019 16th International Conference of the Pacific Association for Computational Linguistics (PACLING 2019), October 11-13, 2019, Ha Noi, Vietnam.

Please contact us via email: kietnv@uit.edu.vn (Mr. Kiet Nguyen) to sign the corpus user agreement and then receive the corpus.

UIT-ViIC (version 1.0) - Vietnamese Image Captioning Dataset

Abstract: Automatic generation of image captions has attracted attentions from researchers in various fields of computer science such as computer vision, natural language processing and machine learning in recent years. This paper contributes to Image captioning problem in terms of extending Image captioning dataset to different language. In particular, we concentrate on generating Vietnamese captions for images, as there is no dataset in Image captioning for Vietnamese existed. We propose a dataset called UIT-ViIC which was annotated manually in Vietnamese with the images from MS - COCO dataset. In addition, we built a web-based annotation tool for improving annotators performances. UIT-ViIC in this scope consists of 19,250 captions for 3,850 images on sport-ball. UIT-ViIC is then experimented and evaluated on existing Image captioning deep neural network models. Our dataset in this scope will be published this on our lab website for researching purpose.

Paper: Quan Hoang Lam, Quang Duy Le, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen. UIT-ViIC: A Dataset for the First Evaluation on Vietnamese Image Captioning.

Please contact us via email: kietnv@uit.edu.vn (Mr. Kiet Nguyen) to sign the corpus user agreement and then receive the corpus.

UIT-ViNames (version 1.0) - Vietnamese Name Dataset

Abstract: As biological gender is one of the aspects of presenting individual human, much work has been done on gender classification based on people names. The proposal for English and Chinese languages are tremendous; still, there has been few works done for Vietnamese so far. We propose a new dataset for gender prediction based on Vietnamese names. This dataset comprises over 26,000 full names annotated with genders. This dataset is available on our website for research purposes. In addition, this paper describes six machine learning algorithms (Support Vector Machine, Multinomial Naive Bayes, Bernoulli Naive Bayes, Decision Tree, Random Forrest and Logistic Regression) and a deep learning model (LSTM) with fastText word embedding for gender prediction on Vietnamese names. We create a dataset and investigate the impact of each name component on detecting gender. As a result, the best F1-score that we have achieved is up to 96% on LSTM model and we generate a web API based on our trained model.

Paper: Huy Quoc To, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen, and Anh Gia-Tuan Nguyen. Gender Prediction Based on Vietnamese Names with Machine Learning Techniques.

Please contact us via email: huytq@uit.edu.vn (Mr. Huy To) to sign the corpus user agreement and then receive the corpus.

UIT-ViOCD: Vietnamese Open-domain Complaint Detection Dataset

Customer product reviews play a role in improving the quality of products and services for organizations or brands. Complaining is an attitude that expresses dissatisfaction with an event or a product not meeting customer expectations. In this paper, we build a Vietnamese dataset (UIT-ViOCD), including 5,485 human-annotated reviews on four categories about product reviews on e-commerce sites. After the data collection phase, we proceed to the annotation task and achieve Am = 87% by Fleiss' Kappa. Then, we present an extensive methodology for the research purposes and achieve 92.16% by F1-score for identifying complaints. With the results, in the future, we want to build a system for open-domain complaint detection in E-commerce websites.

Paper: Nhung Thi-Hong Nguyen, Phuong Ha-Dieu Phan, Luan Thanh Nguyen, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen. Vietnamese Open-domain Complaint Detection in E-Commerce Websites. Link.

Please contact us via email: 18521218@gm.uit.edu.vn (Ms. Nhung) to sign the corpus user agreement and then receive the corpus.

UIT-ViSFD: A Vietnamese Smartphone Feedback Dataset for Aspect-Based Sentiment Analysis

In this paper, we present a process of building a social listening system based on aspect-based sentiment analysis in Vietnamese from creating a dataset to building a real application. Firstly, we create UIT-ViSFD, a Vietnamese Smartphone Feedback Dataset as a new benchmark corpus built based on strict annotation schemes for evaluating aspect-based sentiment analysis, consisting of 11,122 human-annotated comments for mobile e-commerce, which is freely available for research purposes. We also present a proposed approach based on the Bi-LSTM architecture with the fastText word embeddings for the Vietnamese aspect-based sentiment task. Our experiments show that our approach achieves the best performances with the F1-score of 84.48% for the aspect task and 63.06% for the sentiment task, which performs several conventional machine learning and deep learning systems. Last but not least, we build SA2SL, a social listening system based on the best performance model on our dataset, which will inspire more social listening systems in future

Paper: Luong Luc Phan, Phuc Huynh Pham, Kim Thi-Thanh Nguyen, Tham Thi Nguyen, Sieu Khai Huynh, Luan Thanh Nguyen, Tin Van Huynh, Kiet Van Nguyen. SA2SL: From Aspect-Based Sentiment Analysis to Social Listening System for Business Intelligence. Link.

Please contact us via email: kietnv@uit.edu.vn (Mr. Kiet Nguyen) to sign the corpus user agreement and then receive the corpus.

UIT-ViCoQA: A Conversational Question Answering Challenge for Healthcare Texts in Vietnamese

Machine reading comprehension (MRC) is a sub-field in natural language processing or computational linguistics. MRC aims to help computers understand unstructured texts and then answer questions related to them. In this paper, we present a new Vietnamese dataset for conversational machine reading comprehension, consisting of 10,000 questions with answers over 2,000 conversations about health news articles. We analyze UIT-ViCoQA in-depth with different linguistic aspects. We evaluate strong dialogue and reading comprehension models on UIT-ViCoQA. In addition, we conduct the first experiments on this dataset and achieve positive performances. The best system obtains an F1 score of 51,28%, which is 24.90 points behind human performance (76,18%), indicating that there is ample room for improvement. The dataset is available at our research website for research purposes.

Paper: Son T. Luu, Mao Nguyen Bui, Loi Duc Nguyen, Khiem Vinh Tran, Kiet Van Nguyen (Corresponding Author), Ngan Luu-Thuy Nguyen. Conversational Machine Reading Comprehension for Vietnamese Healthcare Texts. Link.

To access this dataset, please complete and sign the dataset user agreement and then send it via email: kietnv@uit.edu.vn (Mr. Kiet Nguyen) to receive the dataset.

kietnv/VietnameseDatasets

Vietnamese Datasets

UIT-ViQuAD (version 1.0) - A Vietnamese Dataset for Evaluating Machine Reading Comprehension. Bộ Dữ liệu Đọc hiểu Tự động cho Tiếng Việt.

UIT-ViNewsQA: New Vietnamese Corpus for Machine Reading Comprehension of Health News Articles

ViMMRC (version 1.0) - Vietnamese Multiple-choice Machine Reading Comprehension Corpus

UIT-VSFC (version 1.0) - Vietnamese Students’ Feedback Corpus

UIT-VSMEC (version 1.0) - Vietnamese Social Media Emotion Corpus

UIT-ViIC (version 1.0) - Vietnamese Image Captioning Dataset

UIT-ViNames (version 1.0) - Vietnamese Name Dataset

UIT-ViOCD: Vietnamese Open-domain Complaint Detection Dataset

UIT-ViSFD: A Vietnamese Smartphone Feedback Dataset for Aspect-Based Sentiment Analysis

UIT-ViCoQA: A Conversational Question Answering Challenge for Healthcare Texts in Vietnamese