/GerryFair

Package implementing methods developed in "Preventing Fairness Gerrymandering" [ICML '18], "Rich Subgroup Fairness for Machine Learning" [ FAT* '19]. active development fork @algowatchupenn

Primary LanguagePythonMIT LicenseMIT

Rich Subgroup Fairness

This repository contains python code for both

Prerequisites

python packages: pandas, numpy, sklearn, matplotlib

Cleaning the data

To test on a custom dataset, two files are needed: a file for the dataset itself and a file listing the types of attributes in the dataset. The dataset itself only needs the label column to have values in 0,1. Our cleaning will automatically one-hot code the categorical variables and, if desired, center the data. For the attributes, each column should have a corresponding label, 0 (unprotected attribute), 1 (protected attribute), or 2 (label). See communities_protected_formatted.csv for an example.

Then, to clean the dataset, use clean.py. The usage can be found by typing:

Fairness Data Cleaning

optional arguments:
  -h, --help            show this help message and exit
  -n NAME, --name NAME  name of the to store the new datasets (Required)
  -d DATASET, --dataset DATASET
                        name of the original dataset file (Required)
  -a ATTRIBUTES, --attributes ATTRIBUTES
                        name of the file representing which attributes are
                        protected (unprotected = 0, protected = 1, label = 2)
                        (Required)
  -c, --centered        Include this flag to determine whether data should be
                        centered

An example of the usage would be

python clean.py -n communities -d dataset/communities_formatted.csv -a dataset/communities_protected_formatted.csv -c

Running the tests

To learn a fair classifier on a dataset in the dataset folder subject to gamma unfairness, use Reg_Oracle_Fict.py. The usage can be found by typing:

python Reg_Oracle_Fict.py -h

    usage: Reg_Oracle_Fict.py [-h] [-C C] [-p] [--heatmap]
                          [--heatmap_iters HEATMAP_ITERS] [-d DATASET]
                          [-a ATTRIBUTES] [-i ITERS] [-g GAMMA_UNFAIR]
                          [--plots]

    Reg_Oracle_Fict input parser
    
    optional arguments:
      -h, --help            show this help message and exit
      -C C                  C is the bound on the maxL1 norm of the dual
                            variables, (Default = 10)
      -p, --print_output    Include this flag to determine whether output is
                            printed, (Default = False)
      --heatmap             Include this flag to determine whether heatmaps are
                            generated, (Default = False)
      --heatmap_iters HEATMAP_ITERS
                            number of iterations heatmap data is saved after,
                            (Default = 1)
      -d DATASET, --dataset DATASET
                            name of the dataset that was input into clean (Required)
      -i ITERS, --iters ITERS
                            number of iterations to terminate after, (Default =
                            10)
      -g GAMMA_UNFAIR, --gamma_unfair GAMMA_UNFAIR
                            approximate gamma disparity allowed in subgroups,
                            (Default = .01)
      --plots               Include this flag to determine whether plots of error
                            and unfairness are generated, (Default = False)

An example of this usage, following the command for clean above, is:

python Reg_Oracle_Fict.py -C 10 -p --heatmap --heatmap_iters 1 -d communities -i 10 -g .01

Again, the arguments are:

  • -C: bound on the max L1 norm of the dual variables with a default value of 10
  • --print_output, -p: flag True or False determines whether output is printed with a default value of False
  • --heatmap: flag True or False determines whether heatmaps are generated with a default value of False
  • --heatmap_iters: number of iterations heatmap data is saved after with a default value of 1
  • --dataset, -d: name of the dataset, this is required.
  • --iters, -i: number of iterations to terminate after with a default value of 10
  • --gamma_unfair, -g: approximate gamma disparity allowed in subgroups with a default value of .01
  • --plots: flag True or False determines whether plots are generated with a default value of False

outputs (if --print_output is included), at each iteration print:

  • ave_error: the error of the current mixture of classifiers found by the Learner)
  • gamma-unfairness: the gamma disparity witnessed by the subgroup found at the current round by the Auditor
  • group_size: the size of the above group conditioned on y = 0
  • frac included ppl: the fraction of the dataset that has been included in a group found by the Auditor thus far (on y =0)
  • coefficients of g_t: the coefficients of the hyperplane that defines the group found by the Auditor
  • if --heatmap is included.

To audit for gamma unfairness on a dataset, use Audit.py. The usage can be found by typing:

python Audit.py -h

    usage: Audit.py [-h] [-d DATASET] [-a ATTRIBUTES] [-i ITERS]

    Audit.py input parser

    optional arguments:
      -h, --help            show this help message and exit
      -d DATASET, --dataset DATASET
                            name of the dataset (communities, lawschool, adult,
                            student, all), (Required)
      -a ATTRIBUTES, --attributes ATTRIBUTES
                            name of the file representing which attributes are
                            protected (unprotected = 0, protected = 1, label = 2)
                            (Required)
      -i ITERS, --iters ITERS
                            number of iterations to terminate after, (Default =
                            10)
  • audits trained logistic regression, SVM, nearest-neighbor model.

Datasets

License

  • Maintained by: Seth Neel (sethneel@wharton.upenn.edu), Will Brown, Adel Boyarsky, Arnab Sarker, Aaron Hallac.
  • Property of: Michael Kearns, Seth Neel, Aaron Roth, Z. Steven Wu.