/explainable-global-fairness-verification

Artifact of the paper "Explainable Global Fairness Verification of Tree-Based Classifiers" (SaTML 2022)

Primary LanguageC++

Explainable Global Fairness Verification

This repository contains the implementation of the synthesiser of sufficient conditions for fairness for decision-tree ensembles proposed by Calzavara et. al. in their research paper titled Explainable Global Fairness Verification of Tree-Based Classifiers. At the moment, this repository contains the code of the synthesiser. The scripts to scripts to reproduce the experiments described in the paper will be added soon.

Installation

Download the repository. Remember to compile using the flags -Iinclude and -lpthread to use the synthesiser.

Requirments

The following requirements are necessary to use the analyzer. For reproducibility reasons, specific versions of the libraries are required.

  • Python3
  • C++14 compiler
  • sklearn (version 0.23.2 used for the experiments)
  • pandas (version 1.1.3 used for the experiments)
  • numpy (version 1.19.1 used for the experiments)
  • boost C++ libraries

Data-Independent Stability Analysis

The resilience-verification repository at https://github.com/FedericoMarcuzzi/resilience-verification contains the instructions to execute the data-independent stability analysis on the tree-based classifier. At the moment, this repo contains the essential code to perform the data-independent stability analysis.

Usage

Apply pre-processing to datasets

The synthesizer requires the dataset to be pre-proc by applying 0-1 normalization and one-hot-encoding to the categorical features. If you want to use datasets not provided by this repo, have a look at the pre-processing functions contained in ./data_gen/misc.py and to the datasets folder in order to understand which files must be generated (names are self explanatory).

For what concerns the datasets used in the experimental evaluation in the paper, execute the command inside the data_gen folder:

python3 preproc_<dataset_name>.py

Generate decision tree ensembles with the supported format

After the dataset pre-processing, execute the following command inside the data_gen folder:

python3 generate_data.py <dataset_abbreviation> 1 <random_seed>

By using this command, the pre-proc dataset is splitted in training-set and test-set.

Supported dataset abbreviations are "ad" (adult), "gm" (german) and "ht" (health). Add our abbreviations in generate_data.py if you want to support other datasets.

Then, execute the following command from the data_gen folder to generate a decision tree-ensemble in the supported format:

python3 generate_data.py <dataset_abbreviation> 0 <random_seed> <n_trees> <max_depth>

Use the synthesizer

Compile the synthesiser using the following string:

g++ -o synthesizer ./src/exec_analyser_synthesiser.cpp -Iinclude -lpthread

The program requires these arguments:

./synthesizer <ensemble_filename.json> <columns_filename.json> <categorical_columns_names.json> <numerical_columns_indexes.json> <categorical_columns_indexes.json> <protected_attribute> <n_iter_analyze> <n_threads> <n_threads_filtering> <filename_dump_hyperrects.json> <test_set_filename.json> <n_iter_synthesiser> <output_conditions_filename_base.json> <normalizations_columns_conditions.json>

The specific parameters are:

  • ensemble_filename.json: the JSON file containing the structure of the tree-based classifier.
  • columns_filename.json: the JSON file containing all the names of the features.
  • categorical_columns_names.json: the JSON file containing the names of the categorical features.
  • numerical_columns_indexes.json: the JSON file containing the indexes of the numerical features.
  • categorical_columns_indexes.json: the JSON file containing the indexes of the categorical features.
  • protected_attribute: name of the sensitive attribute.
  • n_iter_analyze > 0: number of iterations performed by the data-independent stability analyzer.
  • n_threads > 1: number of used threads by the the data-independent stability analyzer.
  • filename_dump_hyperrects.json: name of the file in which the hyper-rectangles generated by the DISA will be dumped.
  • n_iter_synthesiser > 0: number of iterations performed by the synthesizer.
  • output_conditions_filename_base: prefix of the name of the files that will contain the generated conditions separated by their complexity.
  • normalizations_columns_conditions.json: file containing minimu, maximum, mean and std for any feature.

Example

./synthesizer ./adult/models/rf_ad_5_6_7.json ./adult/ad_column_names.json ./adult/ad_categorical_column_names.json ./adult/ad_numerical_binary_column_index.json ./adult/ad_categorical_column_index.json sex_male 40 1 ./res/adult/hypers/hypers_ad_5_6_7_100iterA 4 ./res/adult/fair_conditions/fair_conditions_ad_5_6_7_100iterA_1threadF ./adult/ad_normalization_info.json

Run the experiments presented in the paper

Coming soon...

Credit

If you use this implementation in your work, please add a reference/citation to our paper. You can use the following BibTeX entry:


Support

If you want to ask questions about the code and how to use it, feel free to contact us by sending an email to lorenzo.cazzaro@unive.it or federico.marcuzzi@unive.it.