Classificaiton of political intent in tweets

Based on ~5978 tweets from Dutch politiciants, this LSTM-based classifier tries to indentify what the intention of the tweet was, based on 18 classes. This reaches approximately 47% accoracy. .

Participants

prof. Marcel Broersma, principal investigator, University of Groningen, Faculty of Arts
dr. Marc Esteve del Valle, second principal investigator, University of Groningen, Faculty of Arts
MSc Herbert Teun Kruitbosch, data scientist, University of Groningen, Data science team
dr. Erik Tjong Kim Sang, data scientist, eScience Center

(The data science team is a group of 10 data scientists and alike that assist researchers from all faculties with data science and scientific programming, as part of the universities Center of Information Technology)

Data

We have 5978 tweets which are annotated in 18 categories are:

SHARING FROM OWN NEWS OUTLET
SHARING FROM OTHER OUTLETS
SHARING FROM NON-MEDIA
LIVE REPORTING
SELF PROMOTION
OTHERS PROMOTION
OPINION, CRITIQUE, INTERPRETATION
ARGUING
REQUEST JOURN INPUT
REQUEST NON-JOURN INPUT
RETWEET REQUEST
ADVICE
ACKNOWLEDGEMENT
PERSONAL
ERROR CORRECTION
JOURNALISTIC REFLECTION
OTHER
UNKNOWN

These categories have the distribution of Figure 1.

Figure 1. Class distribution

The 1000 most recent tweets were used as a test-set to avoid train-test contamination because tweets might be similar in the same time frame.

The data is owned by the principal investogator and hence not included in this git-repository.

Model

We've used an ensemble of three models which performed the best in sample out of 10 trained models. Each model was a character-level LSTM.

Results

This classifier obtained 47% accuracy on an out of data sample, Figure 2 and 3 show the confusion matrix and top-n accuracies.

Figure 2. Prediction confusion matrix

Figure 3. Top-n classification accuracy

Code

The twitterlib provides:

methods to scrape twitter by simulating a web browser session: twitterlib.collection.scrape
models to predict meta-information of a tweet using TF-IFD and logistic regression or a Convolutional LSTM.

Implementation

We've applied our method in Google Colab using this notebook.