/FKTC

Library to use fakenews text collections

Primary LanguagePython

FaKe news Text Collections (FKTC)

Library to use fakenews text collections

If you use any part of this code in your research, please cite it using the following BibTex entry

@inproceedings{ref:Golo2021,
  title={Learning textual representations from multiple modalities to detect fake news through one-class learning},
  author={Gôlo, Marcos and Caravanti, Mariana and Rossi, Rafael and Rezende, Solange and Nogueira, Bruno and Marcacini, Ricardo},
  booktitle={Proceedings of the Brazilian Symposium on Multimedia and the Web},
  pages={197--204},
  year={2021}
}

How To use

!pip install git+https://github.com/GoloMarcos/FKTC/
from FakeNewsTextCollections import datasets
datasets_dictionary = datasets.load()
df = datasets_dictionary['fcn']

Datasets

  • Fact Checked News (fcn) : RIBEIRO, V. H. P. Identificação de notícias falsas em língua portuguesa. Monografia (TCC). Universidade Federal de Mato Grosso do Sul, 2019.

  • Fake News Net (fnn) : Fakenewsnet: A data repository with news content, social context, and spatiotemporal information for studying fake news on social media. Big Data, v. 8, n. 3, p. 171–188, 2020.

  • Fake BR (fakebr) : MONTEIRO, R.; SANTOS, R.; PARDO, T.; ALMEIDA, T. de; RUIZ, E.; VALE, O. Contributions to the study of fake news in portuguese: New corpus and automatic detection results. In: PROPOR 2018: International Conference on Computational Processing of the Portuguese Language. [S.l.]: Springer, 2018. p. 324–334.

  • Fake News Corpus 0 (fnc0) : collection derived from https://github.com/several27/FakeNewsCorpus

  • Fake News Corpus 1 (fnc1) : collection derived from https://github.com/several27/FakeNewsCorpus

  • Fake News Corpus 2 (fnc2) : collection derived from https://github.com/several27/FakeNewsCorpus

Datasets Characteristics

- fcn fakebr fnn fnc0 fnc1 fnc2
Language pt pt en en en en
Fake News 1,044 3,598 1,705 3,000 3,000 3,000
Real News 1,020 3,598 5,298 3,000 3,000 3,000
Total News 2,064 7,196 7,003 6,000 6,000 6,000

Columns from DataFrame

  • index: id
  • text: content of the news
  • class: fake (1) | real (-1)
  • folds: {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
  • features: 63 features extracted using Linguistic Inquiry and Word Count (LIWC)
  • features_normalized: 63 features with normalization extracted using LIWC
  • BERT: embedding with 1024 real values
  • DistilBERT: embedding with 768 real values
  • Multilingual DistilBERT: embedding with 512 real values
  • RoBERTa: embedding with 1024 real values

Linguistic Inquiry and Word Count (LIWC)

  • Pennebaker, James W., et al. The development and psychometric properties of LIWC2015. 2015.

We obtain the embeddings with the library sentence_tranformers (v==1.0.4) (https://www.sbert.net/)

  • BERT model: bert-large-nli-stsb-mean-tokens
    • Devlin, Jacob, et al. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019.
  • DistilBERT model: distilbert-base-nli-stsb-mean-tokens
    • Sanh, Victor, et al. "DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter." arXiv preprint arXiv:1910.01108 (2019).
  • RoBERTa model: roberta-large-nli-stsb-mean-tokens
    • Liu, Zhuang, et al. "A Robustly Optimized BERT Pre-training Approach with Post-training." China National Conference on Chinese Computational Linguistics. Springer, Cham, 2021.
  • DistilBERT Multilingual model: distiluse-base-multilingual-cased
    • Reimers, Nils, and Iryna Gurevych. "Making Monolingual Sentence Embeddings Multilingual Using Knowledge Distillation." Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020.