/UCL_Information-Retrival-Data-Mining_Stance-Detection

UCL Information Retrieval & Data Mining project. Work on "Stance Detection" problem, part of the "Fake News Challenge" (FNC).

Primary LanguageJupyter Notebook

COMPGI15: Information Retrieval and Data Mining - Stance Detection

This repository contains the source code and report for the "News Stance Detection" Project of "COMPGI15: IRDM" course for UCL's MSc in Business Analytics (academic year 2017-2018).

Task Description

In context of news, a claim is made in a news headline, as well as in the piece of text in an article body. Quite often, the headline of a news article is created so that it is attractive to the readers, even though the body of the article may be about a different subject/may have another claim than the headline. Stance Detection involves estimating the relative perspective (or stance), of two pieces of text relative, i.e. do the two pieces agree, disagree, discuss or are unrelated to one another. Our task in this project is to estimate the stance of a body text from a news article relative to a headline. The goal in stance detection is to detect whether the headline and the body of an article have the same claim. The stance can be categorized as one of the four labels: “agree”, “disagree”, “discuss” and “unrelated”.

Data

We use the publicly available FNC-1 dataset. This dataset is divided into a training set and a testing set. The ratio of training data over testing data is about 2:1. Every data sample is a pair of a headline and a body. Dataset is severely imbalanced: “unrelated” data takes the majority (over 70%) in both sets while the percentage of “disagree” is less than 3%. The percentage of “agree” and “discuss” are less than 20% and 10%, respectively. We used FNC-1 official baseline to compare our results.

Procedure

Bag of Words and TF-IDF models are implemented from scratch to extract vector representations of headline and bodies, while gensim's Word2Vec library is used as well. Extensive experimentation with feature engineering takes place, and then Logistic Regression (also implemented from scratch) is tested. Finally, "off-the-shell" libraries (e.g. sklearn's Random Forest, XGBoost and Multilayer Perceptron) are used to improve performance.

Results

The best performing classifier was an XGBoost model with 500 estimators, using as features: BoW and TF-IDF cosine similarity between headline and body, TF-IDF euclidean distance between headline and body, unigram and bigram overlap ratio, co-occurence, and count of refuting and discuss words in the body. Results are presented below for both validation and test sets (accuracy according to the FNC-1 official evaluation metric). Model performs 0.65% better than the official baseline on the test set. Random Forest model was 1.33% better than the baseline model but struggled with the disagree class (0.56% F1-score). Results for the XGBoost model presented below:

Validation Set

Agree Disagree Discuss Unrelated
Agree 107 25 215 21
Disagree 16 21 41 6
Discuss 72 33 720 66
Unrealted 4 2 56 3593
Score: 1846.75 out of 2256.75 (81.83%)

Test Set

Agree Disagree Discuss Unrelated
Agree 425 60 1281 137
Disagree 116 43 407 131
Discuss 754 81 3220 409
Unrealted 50 26 373 17900
Score: 8837.75 out of 11651.25 (75.85%)