Fair_Dataset_Distillation
Source codes for SustaiNLP@EMNLP 2022 paper "Towards Fair Supervised Dataset Distillation for Text Classification"
@inproceedings{han-etal-2022-towards-fair,
title = "Towards Fair Dataset Distillation for Text Classification",
author = "Han, Xudong and
Shen, Aili and
Li, Yitong and
Frermann, Lea and
Baldwin, Timothy and
Cohn, Trevor",
booktitle = "Proceedings of The Third Workshop on Simple and Efficient Natural Language Processing (SustaiNLP)",
month = dec,
year = "2022",
address = "Abu Dhabi, United Arab Emirates (Hybrid)",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.sustainlp-1.13",
pages = "65--72",
abstract = "With the growing prevalence of large-scale language models, their energy footprint and potential to learn and amplify historical biases are two pressing challenges. Dataset distillation (DD) {---} a method for reducing the dataset size by learning a small number of synthetic samples which encode the information in the original dataset {---} is a method for reducing the cost of model training, however its impact on fairness has not been studied. We investigate how DD impacts on group bias, with experiments over two language classification tasks, concluding that vanilla DD preserves the bias of the dataset. We then show how existing debiasing methods can be combined with DD to produce models that are fair and accurate, at reduced training cost.",
}
Overview
In this work, we first show that dataset distillation preserves the bias of the dataset, and then propose a framework to combine existing debiaisng methods to produce models that are fair and accurate, at reduced training cost.
Code
This dir includes source codes for reproducing our experimental results in paper.
- The text distillation is deployed based on https://github.com/ilia10000/dataset-distillation
- The bias mitigation approaches are based on https://github.com/HanXudong/fairlib
Accessing Fairness
Fairness evaluation metrics are included in Fair_Dataset_Distillation/fairness_src/evaluator.
Since additional protected labels are required for fairness evaluation and bias mitigation, we provide example dataloaders in Fair_Dataset_Distillation/fairness_src/dataloaders/.
Preprocessing
Similar to the implementation of fairlib, preprocessing approaches are combined with the BaseDataset class, where the distributions of target labels and demographics are balanced.
In-processing
Adversarial training and fair contrastive learning are implemented in Fair_Dataset_Distillation/fairness_src/networks/.
The inclusion of in-processing methods aims at learning fairer synthetic datasets, which can be seen from here.
Scripts
- To reproduce experimental results in this paper, please see the scripts in the
Fair_Dataset_Distillation\Scripts
.
The scripts name is in the following format:
{Dataset}_{Method}_tune_{Number of instances per class}.slurm
Within each file, you should be able to find corresponding command line for running code with all required hyperparameters, for example,
python main.py --mode distill_basic --dataset Bios --arch MLPClassifier --distill_steps 1 --train_nets_type known_init --n_nets 1 --test_nets_type same_as_train --static_labels 0 --random_init_labels zeros --textdata True --fairness True --distill_epochs 3 --distill_lr 0.01 --decay_epochs 10 --epochs 30 --lr 0.01 --ntoken 5000 --ninp 768 --num_workers 0 --base_seed 88613 --adv_debiasing False --results_dir /experimental_results/Bios_Vanilla_tune_1_0/
Dataset
Please follow the instructions https://github.com/HanXudong/fairlib.