This repo contains the code for our paper "Challenges in Automated Debiasing for Toxic Language Detection". In particular, it contains the code to fine-tune RoBERTa and RoBERTa with the ensemble-based method in the task of toxic language prediction. It also contains the index of data points that we used in the experiments.
This repo contains code to detect toxic language with RoBERT/ ensemble-based ROBERTa. Our experiments mainly focus on the dataset from "Large Scale Crowdsourcing and Characterization of Twitter Abusive Behavior".
Our implementation exists in the .\src
folder. The run_toxic.py
file
organize the classifier, and the modeling_roberta_debias.py
builds the
ensemble-based model.
We require pytorch>=1.2 and transformers=2.3.0 Additional requirements are are in
requirements.txt
-
You can find the index of the training data with different data selection methods in
data/founta/train
-
You can find a complete list of entries of data that we need for experiments in
data/demo.csv
-
Out-of-distribution (OOD) data, the two OOD datasets we use are publicly available:
- ONI-adv: This dataset is the test set of the work "Build it Break it Fix it for Dialogue Safety: Robustness from Adversarial Human Attack"
- User-reported: This dataset is from the work 'User-Level Race and Ethnicity Predictors from Twitter Text'
-
Our word list for lexical bias is in the file:
./data/word_based_bias_list.csv
-
Since we do not encourage building systems based on our relabeling dataset, we decide not to release the relabeling dataset publicly. For research purpose, please contact the first author for the access of the dataset.
Run
python ./tools/get_stats.py /location/of/your/data_file.csv
To obtain the Peasonr correlation between toxicity and Tox-Trig words/ aav probabilities.
Run
sh run_toxic.sh
Run
sh run_toxic_debias.sh
You need to obtain the bias-only model first in order to train the ensemble
model. Feel free to use files we provided in the folder tools
.
You can use the same fine-tuning script to obtain predictions from models.
The measuring bias script takes the predictions as input and output models'
performance and lexical/dialectal bias scores. The script is available in the
src
folder.