/BUG

A Large-Scale Gender Bias Dataset for Coreference Resolution and Machine Translation.

Primary LanguagePythonMIT LicenseMIT

Table of Contents generated with DocToc

BUG Dataset

A Large-Scale Gender Bias Dataset for Coreference Resolution and Machine Translation (Levy et al., Findings of EMNLP 2021).

BUG was collected semi-automatically from different real-world corpora, designed to be challenging in terms of soceital gender role assignements for machine translation and coreference resolution.

Setup

  1. Unzip data.tar.gz this should create a data folder with the following files:
    • balanced_BUG.csv
    • full_BUG.csv
    • gold_BUG.csv
  2. Setup a python 3.x environment and install requirements:
pip install -r requirements.txt

Dataset Partitions

NOTE: These partitions vary slightly from those reported in the paper due improvments and bug fixes post submission. For reprducibility's sake, you can access the dataset from the submission here.

Full BUG

105,687 sentences with a human entity, identified by their profession and a gendered pronoun.

Gold BUG

1,717 sentences, the gold-quality human-validated samples.

Balanced BUG

25,504 sentences, randomly sampled from Full BUG to ensure balance between male and female entities and between stereotypical and non-stereotypical gender role assignments.

Dataset Format

Each file in the data folder is a csv file adhering to the following format:

Column Header Description
1 sentence_text Text of sentences with a human entity, identified by their profession and a gendered pronoun
2 tokens List of tokens (using spacy tokenizer)
3 profession The entity in the sentence
4 g The pronoun in the sentence
5 profession_first_index Words offset of profession in sentence
6 g_first_index Words offset of pronoun in sentence
7 predicted gender 'male'/'female' determined by the pronoun
8 stereotype -1/0/1 for anti-stereotype, neutral and stereotype sentence
9 distance The abs distance in words between pronoun and profession
10 num_of_pronouns Number of pronouns in the sentence
11 corpus The corpus from which the sentence is taken
12 data_index The query index of the pattern of the sentence

Evaluations

See below instructions for reproducing our evaluations on BUG.

Coreference

  1. Download the Spanbert predictions from this link.
  2. Unzip and put coref_preds.jsonl in in the predictions/ folder.
  3. From src/evaluations/, run python evaluate_coref.py --in=../../predictions/coref_preds.jsonl --out=../../visualizations/delta_s_by_dist.png.
  4. This should reproduce the coreference evaluation figure.

Conversions

CoNLL

To convert each data partition to CoNLL format run:

python convert_to_conll.py --in=path/to/input/file --out=path/to/output/file

For example, try:

python convert_to_conll.py --in=../../data/gold_BUG.csv --out=./gold_bug.conll

Filter from SPIKE

  1. Download the wanted SPIKE csv files and save them all in the same directory (directory_path).
  2. Make sure the name of each file end with \_<corpusquery><x>.csv where corpus is the name of the SPIKE dataset and x is the number of query you entered on search (for example - myspikedata_wikipedia18.csv).
  3. From src/evaluations/, run python Analyze.py directory_path.
  4. This should reproduce the full dataset and balanced dataset.

Citing

@misc{levy2021collecting,
      title={Collecting a Large-Scale Gender Bias Dataset for Coreference Resolution and Machine Translation}, 
      author={Shahar Levy and Koren Lazar and Gabriel Stanovsky},
      year={2021},
      eprint={2109.03858},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}