The official repository for the paper "Addressing Shortcomings in Fair Graph Learning Datasets: Towards a New Benchmark" (KDD'24 ADS).
run the installation script (a clearer version will come soon)
conda env create -f environment.yml
we use: Python=3.7.9
, dgl-cu102==0.4.3
, torch==1.6.0
In this paper, we develop and introduce a collection of synthetic, semi-synthetic, and real-world datasets. You can find these datasets in the dataset
folder.
Based on the analysis framework in this paper, you can adjust the bias level in synthetic data by setting parameters in synthetic_config.yaml
. Also, you can save or load the existing synthetic datasets by the code in load_data.py
.
Through the functions add_edges
and remove_edges
in utils.py
, we obtain three new semi-synthetic datasets named germanA
, creditA
, and bailA
. Following the analysis framework, You can modify other datasets to achieve the desired bias level.
Our real-world datasets both originate the social data from Twitter.
Because the size is limited, download them from Google Drive.
We provide some statistics of our datasets in the table below:
Dataset | Syn-1 | Syn-2 | New German | New Bail | New Credit | Sport | Occupation |
---|---|---|---|---|---|---|---|
# of nodes | 5,000 | 5,000 | 1,000 | 18,876 | 30,000 | 3,508 | 6,951 |
# of edges | 34,363 | 44,949 | 20,242 | 31,5870 | 1,121,858 | 136,427 | 44,166 |
# of features | 48 | 48 | 27 | 18 | 13 | 768 | 768 |
Sensitive attribute | 0/1 | 0/1 | Gender (Male/Female) | Race (Black/White) | Age ($<$25/$>$25) | Race (White/Black) | Gender (Male/Female) |
Label | 0/1 | 0/1 | Good/bad Credit | Bail/no bail | Payment default/no default | NBA/MLB | Psy/CS |
Average degree | 13.75 | 17.98 | 41.48 | 34.47 | 75.79 | 78.78 | 13.71 |
More details on our datasets can be found in the paper.
To reproduce the experiments, the main scripts running the experiments are in the script
folder. For example, you can train GCN among all datasets by typing:
bash ./script/gcn.sh
Certainly, You can change the parameter search space or modify some commands to implement multi-threaded training.
Please cite our paper if you found our datasets or code helpful.
@misc{qian2024addressing,
title={Addressing Shortcomings in Fair Graph Learning Datasets: Towards a New Benchmark},
author={Xiaowei Qian and Zhimeng Guo and Jialiang Li and Haitao Mao and Bingheng Li and Suhang Wang and Yao Ma},
year={2024},
eprint={2403.06017},
archivePrefix={arXiv},
primaryClass={cs.LG}
}