Fake News Detection

Implementing a fake news detector. Comparing different ML algorithms and NLP strategies.

Requirements

Python 3, at least Python 3.5.2
Python 3 package manager pip3
Tested on Ubuntu 16.04

Running experiment with submission data set

In hyperparameter_optimization/randomized_search.py, line 44 change the value of the variable n_iter_search which determines the number of combinations that are tested in the parameter search to a value of your choice. The lower this value, the faster the script, but the less values are tested.

Then, execute the script ./run-project.sh. A dataset for the submission has been generated by the preprocessing script and placed in the corresponding output folder. It is not necessary to change anything here!

Running experiment step by step

Setup Environment

Python 3 is required for this project. Version used in development is Python 3.6.4.
Run pip3 install -r requirements.txt in the setup folder. This installs the required Python libraries.

Get the data

Download data. First link provides 13,000 fake news articles, second 1 million genuine articles of which a random sample of 52,000 articles will be used in the experiment. The share of fake news articles in the whole dataset is therefore 20%.

Fake News dataset (the file must be unzipped)
The Signal Media One-Million News Articles Dataset (the file must be unzipped)

Preprocessing

Both downloaded files must be placed under their original names in the preprocessing/raw_data directory.

For the experiment, run python3 generate_dataset.py. This generates a dataset with the above mentioned characteristics.
To reproduce the submission dataset, run python3 generate_submission_dataset.py. This generates a dataset with the above mentioned characteristics and 1,500 lines.

Feature Generation

We distinguish two vectorizers:

hash vectorizer: python3 hashing_vectorizer.py -l <lower bound> -u <upper bound>
ngram vectorizer: python3 ngram_vectorizer.py -l <lower bound> -u <upper bound> [-t]

Parameter	Values
`lower_bound`	tested with `1..3`
`upper_bound`	tested with `1..3`
`-t`	use term frequency–inverse document frequency (only for ngrams)

In the experiment, lower_bound and upper_bound were set equal. The implementations are based on sklearn's implementation of the HashingVectorizer and CountVectorizer.

If the training set should be vectorized and the vectorizers stowed away for the regeneration of the feature vectors with the test set, use hashing_vectorizer_training.py and ngram_vectorizer_training.py instead, with the same parameters.

Hyperparameter Optimization

To optimize the parameters for the machine learning algorithms in the experiment, run the hyperparameter optimization script. Therefore, all features for the validation set must already be generated as explained above! In line 44 of the script, you can set the number of configurations that should be tested. A higher number may give better results, but also take longer.
Then, run python3 randomized_search.py.

Feature Regeneration

This step is to create the vectors for the test set that is used in the final assessment of the algorithms. This can only work, when the training set was already vectorized.

Then, run both python3 hashing_revectorizer.py and python3 ngram_revectorizer.py with the same parameter settings that are mentioned in this step.

Classifier Performance

The classifier performance is evaluated with the f1-scoreo. The optimized parameters that were obtained in this step must be configured in code. The output is printed directly on the console. Run experiment.py and have the vectorized training and test set ready in the data folders in feature_generation and feature_regeneration respectively, this should have happened automatically.

tommartensen/fake-news-detector