/WebCredibility

Provides web credibility models (Likert scale) to assign a trustworthiness score to a given website.

Primary LanguagePythonApache License 2.0Apache-2.0

Automated Web Credibility

This project provides the data and models described in the paper:

"Belliting the Source: Trustworthiness Indicators to Obfuscate Fake News on the Web, Esteves et. al. 2018"

@inproceedings{fever2018_fake_news,
  author = {Esteves, Diego and Reddy, Aniketh Janardhan and Chawla, Piyush and Lehmann, Jens},
  booktitle = {Proceedings of the First Workshop on Fact Extraction and VERification (FEVER) - EMNLP 2018},
  pages = {50--59},
  title = {Belittling the Source: Trustworthiness Indicators to Obfuscate Fake News on the Web},
  url = {http://jens-lehmann.org/files/2018/fever_fake_news.pdf},
  year = 2018
}

Module: trustworthiness

0. Configurations

  • definitions.py update local paths here!

1. Pre-processing

preprocessing/

  • fix_dataset_microsoft.py to fix the original Microsoft Credibility dataset.

  • openpg.py exports OpenPageRank data given a set of URLs (datasets) as input

2. Feature Extraction

2.1 feature_extractor.py extract and caches the features for all URLs existing in a given dataset, creating one feature file (*.pkl) for each URL as well as a single final file (features.complex.all.X.pkl) merging all files (multithreading).

- folder: experiment's folder
- dataset: dataset
- export_html_tags: saves locally the HTML code.
- force: forces reprocessing, even if the file already exists.
- outputs:
    - /out/[expX]/[dataset]/features/
        - ok/ -> features files (.pkl for each URL)
        - error/ -> extraction error (one for each URL)
        - html/ -> HTML content for each (successfully) URL
        - features.complex.all.X.pkl (a single file containing: all features (text and html2seq) + y + hash [for all URLs])

2.2 features_split.py splits the features files (features.complex.all.X.pkl) for a given dataset into a set of group of features, converting the features from a json-like format to a np.array ready to be used for training.

- folder: experiment's folder
- dataset: dataset
- outputs: (K=number of ok/ files, where K<=X)
    - /out/[expX]/[dataset]/features/
        1. features.split.basic.K.pkl
        2. features.split.basic_gi.K.pkl
        3. features.split.all.K.pkl (*)
        4. features.split.all+html2seq.K.pkl
        5. features.split.html2seq.K.pkl (*)
        6. features.split.all+html2seq_pad.K.pkl (*) 
            >> linguistic features + padded HTML sequence based on best model HTML

(*) currently the most relevant ones, others are useful for facilitating further experiments.

2.3 features_core.py implements all the features

3. Run

classifiers/

  • benchmark.py to obtain the results and save the models

4. FactBench Eval

  • factbench.py extracts the features and uses a trained model to make predictions on each URL from the FactBench2012_Credibility dataset. This dataset is created from URLs obtained from DeFacto's output over positive and negative data from FactBench dataset.

Release Notes

version 1.0

currently supports the following datasets:

  • Microsoft
  • C3 Corpus

notes

  • the coffeeandnoodles package should be later changed by its pip installation.