This repository contains the improved model for Fake News Challenge Baseline Implementation(Information given below). The Competition(test) scores improved by ~5% with this model.
The training dataset consists of unique headline and body pairs with a corresponding class label. The testing dataset consists of headline and body pair without class label. The goal of the Fake News Challenge(FNC) is to find whether the article agrees, disagrees, discusses or is unrelated to the headline.
Run Hierarchical-main.ipynb file to train the model and see the test score.
Due to size constraints hierarchicalModel-data
folder does not contain below required files:
glove.6B.50d.txt
- This file can be downloaded from https://nlp.stanford.edu/projects/glove/Test_BERT.csv
- This file contains BERT token embedding for all headline and body pair in test data separated by tokenTrain_BERT.csv
- This file contains BERT token embedding for all headline and body pair in train data separated by token.
Both Test_BERT.csv
and Train_BERT.csv
are generated using https://bert-as-service.readthedocs.io/en/latest/section/what-is-it.html
Feature Engineering
- On top of baseline features that included n-grams overlap, topic modelling was used and cosine similarity between topic-document vectors was computed.Classifier 1
- Binary Classification was performed between related and unrelated class using new features + XGBoost classifier.Classifier 2
- The samples labelled as Unrelated by Classifier 1 are classified as Unrelated and samples labelled as related are further passed through Classifier 2 consisiting of Ensemble of 3 DNNs with BERT Embeddings as features and a XGBoost classifier training for 4 classes.
Below are the details of Baseline Implementation that we modified to improve performance:
Information about the fake news challenge can be found on FakeChallenge.org.
This repository contains code that reads the dataset, extracts some simple features, trains a cross-validated model and performs an evaluation on a hold-out set of data.
Credit:
- Byron Galbraith (Github: @bgalbraith, Slack: @byron)
- Humza Iqbal (GitHub: @humzaiqbal, Slack: @humza)
- HJ van Veen (GitHub/Slack: @mlwave)
- Delip Rao (GitHub: @delip, Slack: @dr)
- James Thorne (GitHub/Slack: @j6mes)
- Yuxi Pan (GitHub: @yuxip, Slack: @yuxipan)
Please raise questions in the slack group fakenewschallenge.slack.com
The FNC dataset is inlcuded as a submodule and can be FNC Dataset is included as a submodule. You should download the fnc-1 dataset by running the following commands. This places the fnc-1 dataset into the folder fnc-1/
git submodule init
git submodule update
The dataset class reads the FNC-1 dataset and loads the stances and article bodies into two separate containers.
dataset = DataSet()
You can access these through the .stances
and .articles
variables
print("Total stances: " + str(len(dataset.stances)))
print("Total article bodies: " + str(len(dataset.articles)))
.articles
is a dictionary of articles, indexed by the body id. For example, the text from the 144th article can be printed with the following command:print(dataset.articles[144])
Data is split using the generate_hold_out_split()
function. This function ensures that the article bodies between the training set are not present in the hold-out set. This accepts the following arguments. The body IDs are written to disk.
dataset
- a dataset class that contains the articles and bodiestraining=0.8
- the percentage of data used for the training set (1-training
is used for the hold-out set)base_dir="splits/"
- the directory in which the ids are to be written to disk
The training set is split into k
folds using the kfold_split
function. This reads the holdout/training split from the disk and generates it if the split is not present.
dataset
- dataset readertraining = 0.8
- passed to the hold-out split generation functionn_folds = 10
- number of foldsbase_dir="splits"
- directory to read dataset splits from or write to
This returns 2 items: a array of arrays that contain the ids for stances for each fold, an array that contains the holdout stance IDs.
The get_stances_for_folds
function returns the stances from the original dataset. See fnc_kfold.py
for example usage.
The report_score
function in utils/score.py
is based off the original scorer provided in the FNC-1 dataset repository written by @bgalbraith.
report_score
expects 2 parameters. A list of actual stances (i.e. from the dev dataset), and a list of predicted stances (i.e. what you classifier predicts on the dev dataset). In addition to computing the score, it will also print the score as a percentage of the max score given any set of gold-standard data (such as from a fold or from the hold-out set).
predicted = ['unrelated','discuss',...]
actual = [stance['Stance'] for stance in holdout_stances]
report_score(actual, predicted)
This will print a confusion matrix and a final score your classifier. We provide the scores for a classifier with a simple set of features which you should be able to match and eventually beat!
agree | disagree | discuss | unrelated | |
---|---|---|---|---|
agree | 173 | 10 | 1435 | 28 |
disagree | 39 | 7 | 413 | 238 |
discuss | 221 | 7 | 3556 | 680 |
unrelated | 10 | 3 | 358 | 17978 |
Score: 8761.75 out of 11651.25 (75.20%) |
agree | disagree | discuss | unrelated | |
---|---|---|---|---|
agree | 118 | 3 | 556 | 85 |
disagree | 14 | 3 | 130 | 15 |
discuss | 58 | 5 | 1527 | 210 |
unrelated | 5 | 1 | 98 | 6794 |
Score: 3538.0 out of 4448.5 (79.53%) |