Code and data for the ACL 2023 WASSA paper - Towards Detecting Harmful Agendas in News Articles.
The annotated data can be found in the file newsagendas.jsonl.
- id: Article id.
- article-title: Title of the article.
- article-contents: Cleaned/formatted article contents.
- annotated-labels: Annotated feature labels.
- clickbait
- junkscience
- hatespeech
- conspiracytheory
- propaganda
- satire
- negativesentiment
- neutralsentiment
- positivesentiment
- politicalbias
- calltoaction
- annotated-agenda-score: Annotated agenda score on a scale 1 to 5 with 1 being clearly benign and 5 being clearly malicious. The value is 'no answer' if the annotator did not assign a score.
- annotated-evidence: Snippets of text highlighted by the annotators as evidence for the feature labels they annotated. These snippets are copied directly from the article and formatted as a dictionary.
- split: Which split (dev or test) the article is assigned to (necessary to replicate results from the paper). Articles without an agenda score are assigned to the 'full' split.
- weak-label-0: Original source-level label assigned to the article. The first one listed by the FakeNewsCorpus.
- weak-label-1: Original source-level label assigned to the article. The second one listed by the FakeNewsCorpus.
- weak-label-2: Original source-level label assigned to the article. The third one listed by the FakeNewsCorpus.
The results shown in the paper were generated using Results_Tables.ipynb.
To finetune a BERT model to predict the agenda score from the article title and contents, we use the data splits found in bert_training_datasets for training with cross-validation. You can finetune BERT on these splits to replicate our results in the paper by running:
python BERT_model.py
Our BERT/FRESH feature model predictions on NewsAgendas can be found in the results folder. If you want to retrain the models yourself, you can use the FRESH_dev directory which builds off of the original FRESH paper's work. You can read the updated_README.md in this directory for more information on our modifications. From this directory, you can run:
CUDA_DEVICE={CUDA_DEVICE} \
EPOCHS=50 \
DATASET_NAME={DATASET_NAME} \
CLASSIFIER=bert_classification \
python Rationale_Analysis/experiments/run_for_random_seeds.py \
--script-type fresh/experiment_script.sh \
--defaults-file Rationale_Analysis/default_values/news_b16_r0.2.json
The training datasets are shared at this link (non-Columbia affiliates will need to request access).
If you are using this code, please cite the following:
@inproceedings{subbiah2023towards,
title={Towards Detecting Harmful Agendas in News Articles},
author={Subbiah, Melanie and Bhattacharjee, Amrita and Hua, Yilun and Kumarage, Tharindu and Liu, Huan and McKeown, Kathleen},
booktitle={Proceedings of the 13th Workshop on Computational Approaches to Subjectivity, Sentiment, \& Social Media Analysis},
pages={110--128},
year={2023}
}