The Wikidata Vandalism Detectors FAIR-E and FAIR-S are machine learning models for automatic vandalism detection in Wikidata without discriminating against anonymous editors. They were developed as a joint project between Paderborn University and Leipzig University.
This is the classification and evaluation component for FAIR-E, FAIR-S and the baselines WDVD, ORES, and FILTER. The feature extraction can be done with the corresponding feature extraction component.
This source code forms the basis for our WWW 2019 paper Debiasing Vandalism Detection Models at Wikidata. When using the code, please make sure to refer to it as follows:
@inproceedings{heindorf2019debiasing,
author = {Stefan Heindorf and
Yan Scholten and
Gregor Engels and
Martin Potthast},
title = {Debiasing Vandalism Detection Models at Wikidata},
booktitle = {{WWW}},
publisher = {{ACM}},
year = {2019}
}
The code was tested with Python 3.5.2, 64 Bit under Windows 10 with 16 cores and 256 GB RAM.
We recommend Miniconda for easy installation on many platforms.
- Create new environment:
conda create --name www19-fair python=3.5.2 --file requirements.txt
- Activate environment:
activate www19-fair
- Install Kernel:
python -m ipykernel install --user --name www19-fair --display-name www19-fair
- Start Jupyter:
jupyter notebook
Run the Jupyter notebooks in this order:
01-dataset-analysis.ipynb
02-truth-biases.ipynb
03-baselines.ipynb
04-FAIR-E.ipynb
05-FAIR-S.ipynb
06-evaluation.ipynb
We assume the following project structure:
www19-fair/
├── data/
│ ├── classification/
│ ├── corpus-validity/
│ ├── external/
│ │ └─── wdvc-2016/
│ ├── features/
│ │ ├── test/
│ │ │ ├── embeddings/
│ │ │ └── features.csv.bz2
│ │ ├── training/
│ │ │ ├── embeddings/
│ │ │ └── features.csv.bz2
│ │ ├── validation/
│ │ │ ├── embeddings/
│ │ │ └── features.csv.bz2
│ │ └── wdvd_features.csv.bz2
│ ├── item-properties/
│ └── property-domains/
└── www19-fair-classification/
classification: This folder will contain the output of the classification component: plots, tables, and vandalism scores. Initially, it can be empty.
corpus-validity: Manually reviewed Wikidata revisions. You can download the folder corpus-validity.
external: Contains the Wikidata Vandalism Corpus 2016.
features: Contains the features for our models. The feature extraction can be done with the feature extraction component. Alternatively, you can download the features directly.
item-properties: The list of Wikidata item properties at the end of the training set. The file can be created with the feature extraction component. Alternatively, you can download the item-properties directly.
property-domains: The domain each Wikidata property belongs to. You can download the folder property-domains.
www19-fair-feature-classification: This git repository.
The dataset contains some revisions that change references of subject-predicate-object triples instead of subject-predicate-object triples themselves. In order to filter all references, in the notebook 01-dataset-analysis.ipynb
, the condition df['revisionAction'].isin(revisionActions)
must be changed to (df['revisionAction'].isin(revisionActions) & df['param4'].isna())
. This change has little effect on our evaluation results. For consistency to the paper, we use the original version in this repository.
For questions and feedback please contact:
Stefan Heindorf, Paderborn University
Yan Scholten, Paderborn University
Gregor Engels, Paderborn University
Martin Potthast, Leipzig University
The code by Stefan Heindorf, Yan Scholten, Gregor Engels, Martin Potthast is licensed under a MIT license.