/BERT_propaganda_detection

Propaganda detection using fine-tuned BERT

Primary LanguagePython

Sequence classification for propaganda dataset (QCRI)

visualization

(Code for above visualization not included in repo)

Official PyTorch implementation of the paper Fine-Grained Propaganda Detection with Fine-Tuned BERT.

  1. pip install -r requirements.txt
  2. python -m spacy download en

Train

  1. To create train, dev sets out of the training data: sh tools/split-train.sh
  2. Raw dataset is converted into intermediate pickle files by running preprocess.py on it. Run preprocess.py to generate train and dev files. eg:
    python preprocess.py -d [path to articles and labels directory] -o [name of output file] -l
    -l flag preserves labels if included (needed even when labels aren't available).
  3. Run the trainer, for example

python train.py --expID test_run1--trainDataset train-train.p --evalDataset train-dev.p --model bert --LR 3e-5 --trainBatch 32 --nEpochs 5 --classType all_class --nLabels 21 --testDataset train-split/tasks-2-3/train-dev/ --train True --lowerCase True &
Here, train.p and dev.p is obtained by running preprocess.py.
6. ./exp directory contains the logs and model states for training runs.

Evaluation and Testing

A trained model can be tested on a dataset using python train.py --expID test_run1 --trainDataset train-train.p --evalDataset train-dev.p --model bert --LR 3e-5 --trainBatch 32 --nEpochs 5 --classType all_class --nLabels 21 --testDataset train-split/tasks-2-3/train-dev/ --lowerCase True --loadModel exp/all_class/test_run1/ &. Doing so will use the best model based on the F1 score on validation set during train. A model can be used to produce predictions on a test by first creating the binarized pickle file and then using the previous command. The output will be in the directory containing the model state labelled pred.[test dir name].

Tested on:

QCRI dataset V2 (NLP4IF) huggingface/pytorch-pretrained-BERT 1.0
Pandas 0.25.3
Spacy 2.0.18
Torch 1.3.1

Python 3.7
CUDA 10.1