/CIKM-2024

Primary LanguageJupyter Notebook

CIKM-2024

From Scarcity to Capability: Empowering Fake News Detection in Low-Resource Languages with LLMs

The spread of fake news is a pressing global issue, especially in low-resource languages like Bangla, which lack sufficient datasets and tools for effective detection. Manual fact-checking, though accurate, is time-consuming and allows misleading information to propagate widely. Building on previous efforts, we introduce BanFakeNews-2.0, an enhanced dataset that significantly advances fake news detection capabilities in Bangla. This new version includes 11,700 additional meticulously curated and manually annotated fake news articles, resulting in a more balanced and comprehensive collection of 47,000 authentic news and 13,000 fake news items across 13 categories. In addition, we develop an independent test dataset with 460 fake news and 540 authentic news for rigorous evaluation. To understand the data characteristics, we perform an exploratory analysis of BanFakeNews-2.0 and establish a benchmark system using cutting-edge Natural Language Processing (NLP) techniques. Our benchmark employs transformer-based models, including Bidirectional Encoder Representations from Transformers (BERT) and its Bangla and multilingual variants. Furthermore, we fine-tune the large language models (LLMs) with Quantized Low-Rank Approximation (QLORA), leveraging gradient accumulation and a paged Adam 8-bit optimizer for classification tasks. Our results show that LLMs and transformer-based approaches significantly outperform traditional linguistic feature-based and neural network-based methods in detecting fake news. BanFakeNews-2.0's expanded and balanced dataset offers substantial potential to drive further research and development in fake news detection for low-resource languages. By providing a robust and comprehensive resource, we aim to empower researchers and practitioners to develop more accurate and efficient tools to combat misinformation in Bangla and similar languages.

The following link is directed to our BanFakeNews-2.0 dataset which is uploaded in Kaggle platform. We have annotated our authentic news as 1 and fake news as 0

https://www.kaggle.com/datasets/hrithikmajumdar/bangla-fake-news

The doi link for the BanFakeNews-2.0 dataset is given below which we have published in the Mendeley which is a dataset sharing platform.

https://data.mendeley.com/datasets/kjh887ct4j/1

Traditional Linguistic Features with SVM:

In the FakeNews-master folder , we have actually experimented our classical machine learning model(SVM) preprocessed with linguistic features named as Unigram, Bigram, Trigram and C3, C4 and C5 gram.

Basic Experiments

  • Go to FakeNews-master/Models/Basic folder
  • Use python n-gram.py [Experiment Name] [Model] [-s](optional) to run an experiment. For example: `python n-gram.py Emb_F SVM -s` will run the Emb_F experiment using SVM Model. Use -s to Save the results.

    Experiment Names

    (Please follow the paper to read the details about experiments) :
    • Unigram
    • Bigram
    • Trigram
    • U+B+T
    • C3-gram
    • C4-gram
    • C5-gram
    • C3+C4+C5
    • Embedding
    • all_features

    Models

    • SVM (Support Vector Machine)

BERT model training notebooks of Table: 3

These notebooks have the following naming convention: "training with FakeNews .ipynb"

Training SVM with Embedding features and All Features notebook of Table: 3

Embedding feature notebook name: "Fasttext_svm.ipynb"

All features notebook name: "all-features-svm-c-1-degree-3.ipynb"

BLOOM and Phi-3 mini training notebook of Table: 3

BLOOM notebook name: "bloom-banfakenews1 (1).ipynb"

Phi-3 mini notebook name: "phi3-mini-banfakenews2.xpynb"

Table: 4 notebook name descriptions

"Banfakenews1" represents BanFakeNews Dataset

"Banfakenews2" represents BanFakeNews-2.0 Dataset

"Newnews" represents tested with external Dataset