A Multilingual Stance Detection Pipeline for Political Communication Analysis
Extended Abstract ICA 2022.
This repository contains the models and code accompanying the paper.
ArXiv Link: coming-soon
PDF Link: coming-soon
Poster and Slides: coming-soon
we introduce a pipeline for multilingual stance detection for the analysis of political texts analysis. The main objective of this work is to provide an easy-to-use tool for political communication scholars, to facilitate their substantive research.
- The Prepare Dataset module is in charge of processing and modifying the given dataset of use to a predetermined format for the pipeline.
- The Dataset Loader module is in charge of creating the train and test set, as well as implementing specific sampling techniques, and batch processing.
- For example, if the data set has unbalanced classes1, this module has inbuilt functions to create balanced batches.
- The Model Loader module receives as an input the model-specific parameters, such as tokenizers, language model, and architecture of the additional layers (in case of fine-tuning).
- The Model Trainer module takes as input the training parameters, such as the number of epochs, loss functions, and validation metrics. In turn, this module trains the model with given epoch numbers and saves the best-performing model for later validation, testing, and predictions.
- After training the model, selecting the best hyper-parameters, and testing the predictions, the resulting model can be used to automatically identify stances in the political texts under study.
Dependency | Version | Installation Command |
---|---|---|
Python | 3.8.5 | conda create --name stance python=3.8 and conda activate stance |
PyTorch, cudatoolkit | 1.5.0 | conda install pytorch==1.5.0 cudatoolkit=10.1 -c pytorch |
Transformers (HuggingFace) | 3.5.0 | pip install transformers==3.5.0 |
Scikit-learn | 0.23.1 | pip install scikit-learn==0.23.1 |
scipy | 1.5.0 | pip install scipy==1.5.0 |
Ekphrasis | 0.5.1 | pip install ekphrasis==0.5.1 |
emoji | 0.6.0 | pip install emoji |
wandb | 0.9.4 | pip install wandb==0.9.4 |
Following is the structure of the codebase, in case you wish to play around with it.
train.py
: Model and training loop.bertloader.py
: Common Dataloader for the 6 datasets.params.py
: Argparsing to enable easy experiments.README.md
: This file 🙂.gitignore
: N/Adata
: Directory to store all datasetsdata/wtwt
: Folder for WT–WT datasetdata/wtwt/README.md
: README for setting up WT–WT datasetdata/wtwt/process.py
: Script to set up the WT–WT dataset
data/mt
: Folder for the MT datasetdata/mt/README.md
: README for setting up MT datasetdata/mt/process.py
: Script to set up the MT dataset
data/encryption
: Folder for the Encryption-Debate datasetdata/encryption/README.md
: README for setting up Encryption-Debate datasetdata/encryption/process.py
: Script to set up the Encryption-Debate dataset
data/19rumoureval
: Folder for the RumourEval2019 datasetdata/19rumoureval/README.md
: README for setting up RumourEval2019 datasetdata/19rumoureval/process.py
: Script to set up the RumourEval2019 dataset
data/17rumoureval
: Folder for the RumourEval2017 datasetdata/17rumoureval/README.md
: README for setting up RumourEval2017 datasetdata/17rumoureval/process.py
: Script to set up the RumourEval2017 dataset
data/16semeval
: Folder for the Semeval 2016 datasetdata/16semeval/README.md
: README for setting up Semeval 2016 datasetdata/16semeval/process.py
: Script to set up the Semeval 2016 dataset
- Clone this repository
- Follow the instructions from the
Dependencies
Section above to install the dependencies. - If you are interested in logging your runs, Set up your wandb -
wandb login
.
The dataset is already in the data folder. You can also find the pre-processing notebook in PIm Manifest folder as well.
After following the above steps, move to the basepath for this repository - cd bias-stance
and recreate the experiments by executing python3 train.py [ARGS]
where [ARGS]
are the following:
Required Args:
- dataset_name: The name of dataset to run the experiment on. Possible values are ["16se", "wtwt", "enc", "17re", "19re", "mt1", "mt2"]; Example Usage:
--dataset_name=wtwt
; Type:str
; This is a required argument. - target_merger: When dataset is wtwt, this argument is required to tell the target merger. Example Usage:
--target_merger=CVS_AET
; Type:str
; Valid Arguments: ['CVS_AET', 'ANTM_CI', 'AET_HUM', 'CI_ESRX', 'DIS_FOX'] or not including the argument. - test_mode: Indicates whether to evaluate on the test in the run; Example Usage:
--test_mode=False
; Type:str
- bert_type: A required argument to specify the bert weights to be loaded. Refer HuggingFace. Example Usage:
--bert_type=bert-base-cased
; Type:str
Optional Args:
- seed: The seed for the current run. Example Usage:
--seed=1
; Type:int
- cross_validation_num: A helper input for cross validation in wtwt and enc datasets. Example Usage:
--cross_valid_num=1
; Type:int
- batch_size: The batch size. Example Usage:
--batch_size=16
; Type:int
- lr: Learning Rate. Example Usage:
--lr=1e-5
; Type:float
- n_epochs: Number of epochs. Example Usage:
--n_epochs=5
; Type:int
- dummy_run: Include
--dummy_run
flag to perform a dummy run with a single trainign and validation batch. - device: CPU or CUDA. Example Usage:
--device=cpu
; Type:str
- wandb: Include
--wandb
flag if you want your runs to be logged to wandb. - notarget: Include
--notarget
flag if you want the model to be target oblivious.