/wiki-bias

Primary LanguagePythonMIT LicenseMIT

wiki-bias

This repository contains code for the paper Multilingual Sentence-Level Bias Detection in Wikipedia, as well as three datasets for the task of bias detection.

Prerequisites

  • Python 3.6 or later
  • All dependencies pip install -r /path/to/requirements.txt

Usage

1. URLs

Extract the urls of all parts of the complete page edit history dump. While dumps of small Wikipedias (like the one in Bulgarian) come in a single file, the large ones (English, French, etc.) are split into multiple smaller files.

Make sure to select an existing date from the list on https://dumps.wikimedia.org/{lang}wiki/ and then verify the presence of a dump called All pages with complete page edit history for this particular date.

python url_extractor.py -o <outputfile> -l <xx> -d <YYYMMDD>
# for example
# python url_extractor.py -o urls.txt -l fr -d 20191001

2. Download and extraction of revision pairs

Download all parts of the dump and extract the relevant revision pairs.

Attention! The download and on-the-fly processing of highly compressed dump files requires time. Consider parallelizing this step if you need to process large Wikis split into multiple files.

python filter.py -i <inputfile> -o <outputfile> -l <xx>
# for example
# python filter.py -i urls.txt -o revisions.txt -l fr

3. Preprocessing and diff check

Preprocessing, segmentation, cleanup, diff check, filtering.

python diff.py -i <inputfile> -o <outputfile> -l <xx>
# for example
# python diff.py -i revisions.txt -o diffs.pickle -l fr

4. Sentence extraction

Sentence extraction, duplicates cleanup, classes, balancing.

python sents.py -i <inputfile> -o <outputfile>
# for example
# python sents.py -i diffs.pickle -o sents.pickle

5. Labeling and splitting

Class labels, dataset split.

python dataset.py -i <inputfile> -l <lang> (-p <prefix>)
# for example
# python dataset.py -i sents.pickle -l fr -p __label__

Datasets

Balanced and split datasets in Bulgarian, French and English (extracted from dumps 20190401) can be found in /datasets/

Citation

@InProceedings{aleksandrovamultilingual,
  author = "Aleksandrova, Desislava
		    and Lareau, Fran{\c{c}}ois
		    and M{\'e}nard, Pierre-Andr{\'e}}",
  title = "Multilingual Sentence-Level Bias Detection in Wikipedia",
  booktitle = "Proceedings of the International Conference Recent Advances in Natural Language Processing (RANLP 2019)",
  year = "2019",
  publisher = "Association for Computational Linguistics",
  pages = "42--51",
  location = "Varna, Bulgaria"
}