FairDo is a Python package for mitigating bias in data. It can be used to create datasets that comply with the Artificial Intelligence Act (AI Act).
- Official repository of: Towards Fairness and Privacy: A Novel Data Pre-processing Optimization Framework for Non-binary Protected Attributes
- Documentation: https://fairdo.readthedocs.io/en/latest/
- Source Code: https://github.com/mkduong-ai/fairdo/tree/main
According to the Artificial Intelligence Act (AI Act) adopted by the European Parliament in March 19 2024, one of the key requirements for high-risk AI applications is to ensure that they are fair. A quote from Recital (67) of the AI Act states:
[...] The data sets should also have the appropriate statistical properties, including as regards the persons or groups of persons in relation to whom the high-risk AI system is intended to be used, with specific attention to the mitigation of possible biases in the data sets. [...]
The FairDo package is designed to help data scientists and AI developers to mitigate bias in their data accordingly.
FairDo works specifically for tabular data (pandas.DataFrame
) where the data is pre-processed in
such a way that it becomes fair according to a user-given fairness metric.
The pre-processing approach is fairness-agnostic, enabling the optimization
of different fairness criteria.
The framework is able to deal with non-binary protected attributes
such as nationality, race, and gender that arise in real-world datasets.
Due to the possibility of choosing between any of the available fairness metrics,
it is possible to aim for the least fortunate group
(Rawls' A Theory of Justice [2]) or the general utility of all groups
(Utilitarianism).
The pre-processing methods (fairdo.preprocessing.HeuristicWrapper
and fairdo.preprocessing.DefaultPreprocessing
)
work by removing discriminatory data points.
By doing so, the dataset becomes much more balanced and less biased towards
a particular social group.
We approach this task as a combinatorial optimization problem, which
means selecting a subset of the dataset that minimizes the discrimination score.
Because there are exponentially many possibilities for selecting a subset,
our approach uses genetic algorithms to find a fair subset.
🚀 For a quick start, use the DefaultPreprocessing
class with the default settings.
An example is given in tutorials/1. Default Preprocessor
and below.
✅ For data quality, you want to keep the original data tutorials/
where we use the SDV package
to generate synthetic data.)
For this, you need to specify
.fit_transform(approach='add')
for the pre-processors fairdo.preprocessing.HeuristicWrapper
and
fairdo.preprocessing.DefaultPreprocessing
.
This will only add the fair pre-processed synthetic data
to the original data.
💨 When having limited data, we advise employing synthetic data
💼 When anonymity is required, only use synthetic data
In the following example, we use the COMPAS [1] dataset. The protected attribute is race and the label is recidivism. Here, we deploy the default pre-processor, which internally uses a genetic algorithm, to remove discriminatory samples of the given dataset. The default pre-processor prevents removing all individuals of a single group.
# fairdo package
from fairdo.utils.dataset import load_data
from fairdo.preprocessing import DefaultPreprocessing
# fairdo metrics
from fairdo.metrics import statistical_parity_abs_diff_max
# Loading a sample dataset with all required information
# data is a pandas.DataFrame
data, label, protected_attributes = load_data('compas', print_info=False)
# Initialize DefaultPreprocessing object
preprocessor = DefaultPreprocessing(protected_attribute=protected_attributes[0],
label=label)
# Fit and transform the data
data_fair = preprocessor.fit_transform(dataset=data)
# Print no. samples and discrimination before and after
disc_before = statistical_parity_abs_diff_max(data[label],
data[protected_attributes[0]].to_numpy())
disc_after = statistical_parity_abs_diff_max(data_fair[label],
data_fair[protected_attributes[0]].to_numpy())
print(len(data), disc_before)
print(len(data_fair), disc_after)
By running this example, the resulting dataset usually has a statistical disparity score of <1% (max. score between all five races), while the original dataset exhibits 27% statistical disparity.
First, setup a Python environment. We recommend using Miniconda. Activate the created environment afterwards and finally install our package. A detailed guide is given as follows.
Python (==3.8), numpy
, scipy
, pandas
, sklearn
Download Miniconda here.
# Create a conda virtual environment
conda create -n "venv" python=3.8
# Activate conda environment
conda activate venv
OR
# Create a virtual environment
python -m venv .venv
# Activate the virtual environment
# On Windows:
.venv\Scripts\activate
# On macOS and Linux:
source .venv/bin/activate
The package is distributed via PyPI and can be directly installed with:
pip install fairdo
To install the latest (development) version, execute following commands:
# Clone repo
git clone https://github.com/mkduong-ai/fairdo.git
# Move to repo folder
cd fairdo
# Install from source
python setup.py install
Installing in development mode is useful to make changes in the source code take effect instantly. This means that the package is installed in such a way that changes to the source code are immediately reflected without the need to reinstall the package. This can be done in the following way:
# Clone repo
git clone https://github.com/mkduong-ai/fairdo.git
# Move to repo folder
cd fairdo
# Development installation
pip install -e.
To use the synthetic data generation, you can install the SDV
package
by executing the following command:
pip install sdv==1.10.0
We did not include the SDV
package as a dependency, because it is not required
for the core functionality of the FairDo package.
Using any other synthetic data generation package is also possible.
Still, some examples in the tutorials/
folder require the SDV
package.
When using FairDo in your work, cite our paper:
@inproceedings{duong2023framework,
title={Towards Fairness and Privacy: A Novel Data Pre-processing Optimization Framework for Non-binary Protected Attributes},
author={Duong, Manh Khoi and Conrad, Stefan},
booktitle={Data Science and Machine Learning},
publisher={Springer Nature Singapore},
number={CCIS 1943},
series={AusDM: Australasian Conference on Data Science and Machine Learning},
year={2023},
pages={105--120},
isbn={978-981-99-8696-5},
}
[1] Larson, J., Angwin, J., Mattu, S., Kirchner, L.: Machine bias (May 2016), https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing
[2] Rawls, J.: A Theory of Justice (1971), Belknap Press, ISBN: 978-0-674-00078-0
We credit OpenMoji for the emojis used in our logo.