Addressing Shortcomings in Fair Graph Learning Datasets: Towards a New Benchmark

The official repository for the paper "Addressing Shortcomings in Fair Graph Learning Datasets: Towards a New Benchmark" （KDD'24 ADS）.

Installation

If you want to recreate the original environment used for the paper:

run the installation script (a clearer version will come soon)

conda env create -f environment.yml

Otherwise, for a new env:

we use: Python=3.7.9, dgl-cu102==0.4.3, torch==1.6.0

Datasets

In this paper, we develop and introduce a collection of synthetic, semi-synthetic, and real-world datasets. You can find these datasets in the dataset folder.

Synthetic dataset

Based on the analysis framework in this paper, you can adjust the bias level in synthetic data by setting parameters in synthetic_config.yaml. Also, you can save or load the existing synthetic datasets by the code in load_data.py.

Semi-synthetic dataset

Through the functions add_edges and remove_edges in utils.py, we obtain three new semi-synthetic datasets named germanA, creditA, and bailA. Following the analysis framework, You can modify other datasets to achieve the desired bias level.

Real-world dataset

Our real-world datasets both originate the social data from Twitter.

Because the size is limited, download them from Google Drive.

We provide some statistics of our datasets in the table below:

Dataset	Syn-1	Syn-2	New German	New Bail	New Credit	Sport	Occupation
# of nodes	5,000	5,000	1,000	18,876	30,000	3,508	6,951
# of edges	34,363	44,949	20,242	31,5870	1,121,858	136,427	44,166
# of features	48	48	27	18	13	768	768
Sensitive attribute	0/1	0/1	Gender (Male/Female)	Race (Black/White)	Age ($<$25/$>$25)	Race (White/Black)	Gender (Male/Female)
Label	0/1	0/1	Good/bad Credit	Bail/no bail	Payment default/no default	NBA/MLB	Psy/CS
Average degree	13.75	17.98	41.48	34.47	75.79	78.78	13.71

More details on our datasets can be found in the paper.

Running the experiments

To reproduce the experiments, the main scripts running the experiments are in the script folder. For example, you can train GCN among all datasets by typing:

bash ./script/gcn.sh

Certainly, You can change the parameter search space or modify some commands to implement multi-threaded training.

Citation

Please cite our paper if you found our datasets or code helpful.

@misc{qian2024addressing,
      title={Addressing Shortcomings in Fair Graph Learning Datasets: Towards a New Benchmark}, 
      author={Xiaowei Qian and Zhimeng Guo and Jialiang Li and Haitao Mao and Bingheng Li and Suhang Wang and Yao Ma},
      year={2024},
      eprint={2403.06017},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

XweiQ/Benchmark-GraphFairness