***** New August 23th, 2020*****
In this project, we will introduce two BERT fine-tuning methods for the sentiment analysis problem for Vietnamese comments, a method proposed by the BERT authors using only the [CLS] token as the inputs for an attached feed-forward neural network, a method we have proposed, in which all output vectors are used as inputs for other classification models. Experimental results on two datasets show that models using BERT is outperforming than other models. In particular, in both results, our method always produces a model with better performance than the BERTbase method.
When I work my project, we are working project on Google Colab.
Code structure:
- BERT-base: We use Pretrained BERT model
- BERT-embedding-CNN: We use finetuning BERT combine TextCNN or RCNN
- BERT-embedding-LSTM: We use finetuning BERT combine LSTM
- Data: Data for experiment
- Test: We use other embedding model combine neural network.
- Machine Learning Model: We use finetuning BERT combine machine learning algorithm
Using google colab to training finetuning BERT combine neural networks. After, using file predict_text to test with real data.
Data description after preprocessing in Data folder
Dataset | Train | Test | Totally |
---|---|---|---|
ntc-sv | 20,493 Pos & 20,267 Neg | 5,000 Pos & 5,000 Neg | 50,760 |
vreview | 22,979 Pos & 19,537 Neg | 8,301 Pos & 6,795 Neg | 57,612 |
Statistics of words include in the comments
vreview | ntc-sv | |
---|---|---|
Mean | 55.45 | 86.57 |
Stdn | 63.75 | 77.41 |
Min | 1 | 1 |
25% | 14 | 37 |
50% | 32 | 65 |
75% | 76 | 111 |
Max | 435 | 1,501 |
In this section, we will compare several models, which can be used for sentiment analysis for Vietnamese:
-
SVM / Boosting: There are two basic machine learning algorithms, used much before deep learning algorithms prevailed, with SVM we use features from n-grams, n in the range [1,5]. Our Boosting algorithm will use XGBoosting with a deep of 15.
-
FastText + LSTM / TextCNN / RCNN: We choose three algorithms to combine with word embedding model FastText is similar to 3 models associated with BERT, this will help us more accurately evaluate the results that the model of we archive. The pretrained FastText used has been trained on Vietnamese dataset.
-
GloVE + LSTM / TextCNN / RCNN: We will use glove-python6 library to train word embedding. The reason for this is because we don’t have GloVE pre-trains with the same dimensions as the FastText pre-train. Therefore, it will ensure the fairness for all. Models associated with GloVE will be similar to models combine with FastText.
The results of comparing the models with the two datasets are shown in the table below. Measure is F1-score.
Result of our model on NTC-SV Dataset compared to other models
Model | Precision(%) | Recall(%) | F1(%) |
---|---|---|---|
SVM | 89.23 | 92.52 | 90.84 |
XGBoost | 88.76 | 90.58 | 89.63 |
FastText + TextCNN | 67.9 | 89.1 | 77.1 |
FastText + LSTM | 88.5 | 89.7 | 89.1 |
FastText + RCNN | 89.2 | 91.7 | 90.4 |
Glove + TextCNN | 69.7 | 87.7 | 77.7 |
Glove + LSTM | 88.7 | 91.8 | 89.8 |
Glove + RCNN | 85.8 | 85.8 | 90.7 |
BERT-base | 88.13 | 94.02 | 90.9 |
BERT-LSTM | 89.78 | 92.08 | 90.91 |
BERT-TextCNN | 88.85 | 93.14 | 90.94 |
BERT-RCNN | 88.76 | 93.68 | 91.15 |
Result of our model on Vreview dataset compared to other models
Model | Precision(%) | Recall(%) | F1(%) |
---|---|---|---|
SVM | 86.26 | 86.9 | 86.5 |
XGBoost | 87.69 | 88.45 | 88.07 |
FastText + TextCNN | 61.8 | 94 | 74.6 |
FastText + LSTM | 88.5 | 86.4 | 87.5 |
FastText + RCNN | 84.5 | 89.8 | 87.1 |
Glove + TextCNN | 62.6 | 93 | 74.8 |
Glove + LSTM | 85.8 | 85.8 | 85.8 |
Glove + RCNN | 84.0 | 88.6 | 86.2 |
BERT-base | 86.08 | 88.44 | 87.2 |
BERT-LSTM | 85.25 | 89.9 | 87.5 |
BERT-TextCNN | 90.9 | 85.2 | 87.98 |
BERT-RCNN | 87.08 | 89.38 | 88.22 |
Nguyen Quoc Thai
Nguyen Thoai Linh
Quoc Hung Ngo
Hoang Ngoc Luong
University of Information Technology
Vietnam National University - HCM City
Ho Chi Minh City, Viet Nam
Link paper GitHub.