Comparing rule-based methods and pre-trained language models to classify flood related Tweets

Cillian Berragan [@cjberragan]^1* Alessia Calafiore [@alel_domi]¹

¹ Geographic Data Science Lab, University of Liverpool, Liverpool, United Kingdom

^* Correspondence: C.Berragan@liverpool.ac.uk

Abstract

Social media presents a rich source of real-time information provided by individual users in emergency situations. However, due to its unstructured nature and high volume, it is challenging to extract key information from these continuous data streams. This paper compares the ability to identify relevant flood related Tweets between a deep neural classification model known as a transformer, and a simple rule-based classification. Results show that the classification model out-performs the rule-based approach, at the time-cost of labelling and training the model.

Description

This repository contains the code for building a RoBERTa-based binary text classification model, trained to identify relevant and irrelevant flood related Tweets. Model training uses a labelled corpus of Tweets extracted during past severe flood events in the United Kingdom, using flood zone bounding boxes.

Inference over a separate testing corpus is compared against a keyword based classification method.

Project layout

src
├── common
│   ├── get_tweets.py  # download tweets to csv through twitter api
│   └── utils.py  # various utility functions
│
├── pl_data
│   ├── csv_dataset.py  # torch dataset for flood data
│   └── datamodule.py  # lightning datamodule
│
├── pl_module
│   └── classifier_model.py  # flood classification model
│
├── run.py  # train model
└── inf.py  # use model checkpoint for inference and compare with keywords

How to run

Tweet corpus is not available for model training due to Twitter terms. To train using your own data place a csv into data/train/labelled.csv with data and label columns.

Docker currently uses demo_data to demonstrate model training

Poetry

Install dependencies using Poetry:

poetry install

Train classifier model using the labelled flood Tweets corpus:

poetry run python -m src.run

Docker

With docker compose

docker compose up

OR:

Build image from Dockerfile:

docker build . -t cjber/flood_tweets

Run with GPU and mapped volumes:

docker run --rm --gpus all -v ${PWD}/ckpts:/flood/ckpts -v ${PWD}/csv_logs:/flood/csv_logs cjber/flood_tweets

cjber/tweet-classification