COMPGI15: Information Retrieval and Data Mining - Stance Detection

This repository contains the source code and report for the "News Stance Detection" Project of "COMPGI15: IRDM" course for UCL's MSc in Business Analytics (academic year 2017-2018).

Task Description

In context of news, a claim is made in a news headline, as well as in the piece of text in an article body. Quite often, the headline of a news article is created so that it is attractive to the readers, even though the body of the article may be about a different subject/may have another claim than the headline. Stance Detection involves estimating the relative perspective (or stance), of two pieces of text relative, i.e. do the two pieces agree, disagree, discuss or are unrelated to one another. Our task in this project is to estimate the stance of a body text from a news article relative to a headline. The goal in stance detection is to detect whether the headline and the body of an article have the same claim. The stance can be categorized as one of the four labels: “agree”, “disagree”, “discuss” and “unrelated”.

Data

We use the publicly available FNC-1 dataset. This dataset is divided into a training set and a testing set. The ratio of training data over testing data is about 2:1. Every data sample is a pair of a headline and a body. Dataset is severely imbalanced: “unrelated” data takes the majority (over 70%) in both sets while the percentage of “disagree” is less than 3%. The percentage of “agree” and “discuss” are less than 20% and 10%, respectively. We used FNC-1 official baseline to compare our results.

Procedure

Bag of Words and TF-IDF models are implemented from scratch to extract vector representations of headline and bodies, while gensim's Word2Vec library is used as well. Extensive experimentation with feature engineering takes place, and then Logistic Regression (also implemented from scratch) is tested. Finally, "off-the-shell" libraries (e.g. sklearn's Random Forest, XGBoost and Multilayer Perceptron) are used to improve performance.

Results

The best performing classifier was an XGBoost model with 500 estimators, using as features: BoW and TF-IDF cosine similarity between headline and body, TF-IDF euclidean distance between headline and body, unigram and bigram overlap ratio, co-occurence, and count of refuting and discuss words in the body. Results are presented below for both validation and test sets (accuracy according to the FNC-1 official evaluation metric). Model performs 0.65% better than the official baseline on the test set. Random Forest model was 1.33% better than the baseline model but struggled with the disagree class (0.56% F1-score). Results for the XGBoost model presented below:

Validation Set

	Agree	Disagree	Discuss	Unrelated
Agree	107	25	215	21
Disagree	16	21	41	6
Discuss	72	33	720	66
Unrealted	4	2	56	3593
Score: 1846.75 out of 2256.75 (81.83%)

Test Set

	Agree	Disagree	Discuss	Unrelated
Agree	425	60	1281	137
Disagree	116	43	407	131
Discuss	754	81	3220	409
Unrealted	50	26	373	17900
Score: 8837.75 out of 11651.25 (75.85%)

asfakianakis/UCL_Information-Retrival-Data-Mining_Stance-Detection

COMPGI15: Information Retrieval and Data Mining - Stance Detection

Task Description

Data

Procedure

Results