kietnv
Natural Language Processing, Computational Linguistics, Parsing, Sentiment Analysis, Machine Reading Comprehension, and Question Answering
University of Information TechnologyHo Chi Minh City
Pinned Repositories
COVIDROP
Vi-COVIDQA is a numerical reasoning based machine reading comprehension dataset in Vietnamese
Datasets-for-Sentiment-Analysis
Benchmark datasets for sentiment analysis
MRC-tool
NLP-Vietnamese-progress
Repository to track the progress in Vietnamese Natural Language Processing, including the datasets and the current state-of-the-art for the most common Vietnamese NLP tasks.
UIT-ViSD4SA
ViSD4SA, a Vietnamese Span Detection for Aspect-based sentment analysis dataset
uit-vsfc
Vietnamese Students’ Feedback Corpus (UIT-VSFC) is the resource consists of over 16,000 sentences which are human-annotated with two different tasks: sentiment-based and topic-based classifications.
ViCCGbank
VietnameseDatasets
We provide benchmark datasets for evaluating Vietnamese processing models: UIT-ViQuAD, ViNewsQA, UIT-VSFC, UIT-ViIC, UIT-ViNames, UIT-VSMEC and ViMMRC.
ViHOS
Repository for the paper "ViHOS: Vietnamese Hate and Offensive Spans Detection" (EACL2023)
vireader
Machine Reading Comprehension has attracted significant interest in research on natural language understanding, and large-scale datasets and neural network-based methods have been developed for this task. However, most developments of resources and methods in machine reading comprehension have been investigated using two resource-rich languages, English and Chinese. This article proposes a system called ViReader for open-domain machine reading comprehension in Vietnamese by using Wikipedia as the textual knowledge source, where the answer to any particular question is a textual span derived directly from texts on Vietnamese Wikipedia. Our system combines a sentence retriever component, based on techniques of information retrieval to extract the relevant sentences, with a transfer learning-based answer extractor trained to predict answers based on Wikipedia texts. Experiments on multiple datasets for machine reading comprehension in Vietnamese and other languages demonstrate that (1) our ViReader system is highly competitive with prevalent machine learning-based systems, and (2) multi-task learning by using a combination consisting of the sentence retriever and answer extractor is an end-to-end reading comprehension system. The sentence retriever component of our proposed system retrieves the sentences that are most likely to provide the answer response to the given question. The transfer learning-based answer extractor then reads the document from which the sentences have been retrieved, predicts the answer, and returns it to the user. The ViReader system achieves new state-of-the-art performances, with values of 70.83% EM (exact match) and 89.54% F1, outperforming the BERT-based system by 11.55% and 9.54%, respectively. It also obtains state-of-the-art performance on UIT-ViNewsQA (another Vietnamese dataset consisting of online health-domain news) and BiPaR (a bilingual dataset on English and Chinese novel texts). Compared with the BERT-based system, our system achieves significant improvements (in terms of F1) with 7.65% for English and 6.13% for Chinese on the BiPaR dataset. Furthermore, we build a ViReader application programming interface that programmers can employ in Artificial Intelligence applications.
kietnv's Repositories
kietnv/VietnameseDatasets
We provide benchmark datasets for evaluating Vietnamese processing models: UIT-ViQuAD, ViNewsQA, UIT-VSFC, UIT-ViIC, UIT-ViNames, UIT-VSMEC and ViMMRC.
kietnv/vireader
Machine Reading Comprehension has attracted significant interest in research on natural language understanding, and large-scale datasets and neural network-based methods have been developed for this task. However, most developments of resources and methods in machine reading comprehension have been investigated using two resource-rich languages, English and Chinese. This article proposes a system called ViReader for open-domain machine reading comprehension in Vietnamese by using Wikipedia as the textual knowledge source, where the answer to any particular question is a textual span derived directly from texts on Vietnamese Wikipedia. Our system combines a sentence retriever component, based on techniques of information retrieval to extract the relevant sentences, with a transfer learning-based answer extractor trained to predict answers based on Wikipedia texts. Experiments on multiple datasets for machine reading comprehension in Vietnamese and other languages demonstrate that (1) our ViReader system is highly competitive with prevalent machine learning-based systems, and (2) multi-task learning by using a combination consisting of the sentence retriever and answer extractor is an end-to-end reading comprehension system. The sentence retriever component of our proposed system retrieves the sentences that are most likely to provide the answer response to the given question. The transfer learning-based answer extractor then reads the document from which the sentences have been retrieved, predicts the answer, and returns it to the user. The ViReader system achieves new state-of-the-art performances, with values of 70.83% EM (exact match) and 89.54% F1, outperforming the BERT-based system by 11.55% and 9.54%, respectively. It also obtains state-of-the-art performance on UIT-ViNewsQA (another Vietnamese dataset consisting of online health-domain news) and BiPaR (a bilingual dataset on English and Chinese novel texts). Compared with the BERT-based system, our system achieves significant improvements (in terms of F1) with 7.65% for English and 6.13% for Chinese on the BiPaR dataset. Furthermore, we build a ViReader application programming interface that programmers can employ in Artificial Intelligence applications.
kietnv/Datasets-for-Sentiment-Analysis
Benchmark datasets for sentiment analysis
kietnv/NLP-Vietnamese-progress
Repository to track the progress in Vietnamese Natural Language Processing, including the datasets and the current state-of-the-art for the most common Vietnamese NLP tasks.
kietnv/COVIDROP
Vi-COVIDQA is a numerical reasoning based machine reading comprehension dataset in Vietnamese
kietnv/MRC-tool
kietnv/UIT-ViSD4SA
ViSD4SA, a Vietnamese Span Detection for Aspect-based sentment analysis dataset
kietnv/uit-vsfc
Vietnamese Students’ Feedback Corpus (UIT-VSFC) is the resource consists of over 16,000 sentences which are human-annotated with two different tasks: sentiment-based and topic-based classifications.
kietnv/ViCCGbank
kietnv/ViHOS
Repository for the paper "ViHOS: Vietnamese Hate and Offensive Spans Detection" (EACL2023)