FaKe news Text Collections (FKTC)
Library to use fakenews text collections
If you use any part of this code in your research, please cite it using the following BibTex entry
@inproceedings{ref:Golo2021,
title={Learning textual representations from multiple modalities to detect fake news through one-class learning},
author={Gôlo, Marcos and Caravanti, Mariana and Rossi, Rafael and Rezende, Solange and Nogueira, Bruno and Marcacini, Ricardo},
booktitle={Proceedings of the Brazilian Symposium on Multimedia and the Web},
pages={197--204},
year={2021}
}
How To use
!pip install git+https://github.com/GoloMarcos/FKTC/
from FakeNewsTextCollections import datasets
datasets_dictionary = datasets.load()
df = datasets_dictionary['fcn']
Datasets
-
Fact Checked News (fcn) : RIBEIRO, V. H. P. Identificação de notícias falsas em língua portuguesa. Monografia (TCC). Universidade Federal de Mato Grosso do Sul, 2019.
-
Fake News Net (fnn) : Fakenewsnet: A data repository with news content, social context, and spatiotemporal information for studying fake news on social media. Big Data, v. 8, n. 3, p. 171–188, 2020.
-
Fake BR (fakebr) : MONTEIRO, R.; SANTOS, R.; PARDO, T.; ALMEIDA, T. de; RUIZ, E.; VALE, O. Contributions to the study of fake news in portuguese: New corpus and automatic detection results. In: PROPOR 2018: International Conference on Computational Processing of the Portuguese Language. [S.l.]: Springer, 2018. p. 324–334.
-
Fake News Corpus 0 (fnc0) : collection derived from https://github.com/several27/FakeNewsCorpus
-
Fake News Corpus 1 (fnc1) : collection derived from https://github.com/several27/FakeNewsCorpus
-
Fake News Corpus 2 (fnc2) : collection derived from https://github.com/several27/FakeNewsCorpus
Datasets Characteristics
- | fcn | fakebr | fnn | fnc0 | fnc1 | fnc2 |
---|---|---|---|---|---|---|
Language | pt | pt | en | en | en | en |
Fake News | 1,044 | 3,598 | 1,705 | 3,000 | 3,000 | 3,000 |
Real News | 1,020 | 3,598 | 5,298 | 3,000 | 3,000 | 3,000 |
Total News | 2,064 | 7,196 | 7,003 | 6,000 | 6,000 | 6,000 |
Columns from DataFrame
- index: id
- text: content of the news
- class: fake (1) | real (-1)
- folds: {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
- features: 63 features extracted using Linguistic Inquiry and Word Count (LIWC)
- features_normalized: 63 features with normalization extracted using LIWC
- BERT: embedding with 1024 real values
- DistilBERT: embedding with 768 real values
- Multilingual DistilBERT: embedding with 512 real values
- RoBERTa: embedding with 1024 real values
Linguistic Inquiry and Word Count (LIWC)
- Pennebaker, James W., et al. The development and psychometric properties of LIWC2015. 2015.
https://www.sbert.net/)
We obtain the embeddings with the library sentence_tranformers (v==1.0.4) (- BERT model: bert-large-nli-stsb-mean-tokens
- Devlin, Jacob, et al. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019.
- DistilBERT model: distilbert-base-nli-stsb-mean-tokens
- Sanh, Victor, et al. "DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter." arXiv preprint arXiv:1910.01108 (2019).
- RoBERTa model: roberta-large-nli-stsb-mean-tokens
- Liu, Zhuang, et al. "A Robustly Optimized BERT Pre-training Approach with Post-training." China National Conference on Chinese Computational Linguistics. Springer, Cham, 2021.
- DistilBERT Multilingual model: distiluse-base-multilingual-cased
- Reimers, Nils, and Iryna Gurevych. "Making Monolingual Sentence Embeddings Multilingual Using Knowledge Distillation." Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020.