Attribute Classification
of COVID-19-Related Tweets Based on Natural Language Processing Models (Student Research Training Program
)
Our work is based on NLP4IF-Workshop--Shared-Task-On-Fighting the COVID-19 Infodemic.
The major task is to predict a series of binary attributes
of COVID-19 Twitter from seven aspects. The first, sixth and seventh questions are whether it is a Verifiable Factual Claim
, whether it is Harmful to Society
and whether it Requires Attention
. The second, third, fourth and fifth questions are based on the first question. If it is a factual statement, then it is necessary to further judge whether it is False Information
, whether it arouses Interest to General Public
, what is the Harmfulness
and Need of Verification
.
This is a multi task problem, and there is a dependency between tasks.
The dataset includes Twitter in English
, Bulgarian
and Arabic
.
Because it is the real comment on Twitter, it inevitably contains emoji
and URL
, which brings some challenges to data preprocessing.
Inspired by the design and ideas in Multi Output Learning using Task Wise Attention for Predicting Binary Properties of Tweets : Shared-Task-On-Fighting the COVID-19 Infodemic that ranked second in the competition at that time, we established our baseline and made further improvements to the training pipeline.
Adopted by keeping labels on different language training datasets and mutual translation in the data preprocessing stage
.
Bert
, RoBERTa
, XLM-RoBERTa
models are used in the pre-training
stage.
BiLSTM+Attn
, TextCNN
, MultiHead Attn
models are utilized in the classifier
.
Loss function
with biased weights
is improved on the basis of the Uniform weights
in the original paper.
Finally, we propose a voting mechanism. There are two schemes: All vote
and Top6 vote
.
After the attempt and optimization, taking Mean F1 Core as the standard, we trained 12 models, some of which far surpassed the best average F1-score of 89.7%
in Fighting the COVID-19 Infodemic with a Holistic BERT Ensemble, including Roberta-lstm-attn (91.38%
), xlmRoberta-lstm-attn (91.09%
), xlmRoberta-lstm-attn-biasedWeight (90.67%
), xlmRoberta-multihead (90.49%
), etc., optimized the training result by adopting voting mechanism and reached an ultimate best-vote of 93.54%
.