/covid_fake_news

covid_fake_news

Primary LanguageJupyter NotebookMIT LicenseMIT

COVID19 Fake News Detection in English 🔎 👀

This repository contains the code for implementing the "A Heuristic-driven Ensemble Framework for COVID-19 Fake News Detection " (Accepted at CONSTRAINT Workshop, AAAI 2021).

Preprint: https://arxiv.org/abs/2101.03545

Task Description

It is a subtask in the CONSTRAINT-2021 shared task on the hostile post detection. This subtask focuses on the detection of COVID19-related fake news in English. The sources of data are various social-media platforms such as Twitter, Facebook, Instagram, etc. Given a social media post, the objective of the shared task is to classify it into either fake or real news.

For example, the following two posts belong to fake and real categories, respectively. image

English Dataset: https://competitions.codalab.org/competitions/26655 or https://github.com/diptamath/covid_fake_news/tree/main/data

English dataset paper: https://arxiv.org/abs/2011.03327

Link to Competition: https://constraint-shared-task-2021.github.io/

Our Approach

Our basic approach involves trying out different language models. Such model have achievedstate-of-the-art results on a variety of text classification tasks, which was the basic driving force behind our intuition to use them. We have tried out different language models like XLNet, RoBERTa, XLM-RoBERTa, DeBERTa, ELECTRA and ERNIE2.0. The individual training model files can be obtained here.

In order to improve the performance of our classification model, we have tried out various ensemble techniques using various combinations of these models. The combination that has yielded the best result is the one using XLNet, RoBERTa, XLM-RoBERTa, DeBERTa. We have created a new feature set using the predictions from different model predictions and saved the resulting feature data. We have also tried out 2 ensemble techniques: Hard Voting and Soft Voting, where Soft Voting has achieved superior results with the above model combination. The code files related to ensembling can be found at this link.

All our work related to Heuristic Post-Processing can be obtained from the Analysis Folder. First, we extract our username statistics and domain statistics from the training data and save them in the Statistical meta folder. We merge our statistical features using this code. Finally, we create our datasets for post-processing and apply our post-processing algorithm to obtain the final classification result.

We also perform an ablation study regarding the priority of username handles and URL domains, and also regarding the threshold parameter, which can be accessed here.

Results

  • Our initial approach using ensembling achieved an F-score of 98.31 against the 98.69 F1-score of the leaderboard topper
  • Post evaluation, we have been able to improve our solution drastically achieving an F1-score of 98.83, using Heuristic Post-Processing

Citation

Please consider citing our paper in your publications if the project helps your research. The BibTeX reference is as follows:

@article{das2021heuristic,
title={A Heuristic-driven Ensemble Framework for COVID-19 Fake News Detection},
author={Das, Sourya Dipta and Basak, Ayan and Dutta, Saikat},
journal={arXiv preprint arXiv:2101.03545},
year={2021}