Suspicion Machine

This project is part of the Surveillance Newsroom at Lighthouse Reports. Other projects from Lighthouse Reports can be found at the main GitHub repo.

Project Overview

This project contains the source code for an investigation into the use of a welfare fraud prediction algorithm by the city of Rotterdam. The model used by Rotterdam is a Gradient Boosting Machine (GBM) trained on a dataset of 12,707 past investigations into welfare fraud. This project contains source code for and results of the experiments conducted to explore and identify potential bias in the GBM model.

Reporting Partners

WIRED
Vers Beton
Follow the Money
Argos

Methods Used

Statistical Parity
Inferential Statistics
Data Wrangling
Data Visualisation

Technologies

R
Jupyter Notebook

Project Description

This project explores the impact of beneficiaries' attributes on their risk scores. The GBM model evaluates the welfare beneficiaries based on 315 features and assigns each applicant a risk score from [0, 1], where higher risk score indicate greater risk of having committed some kind of 'illegality'. Illegality encompasses everything from simple mistakes on a form to serious fraud. This project uses data obtained through press questions and public records requests from Rotterdam Municipality and employs the following approaches:

Testing whether the fraud prediction system violates Statistical Parity and Conditional Statistical Parity
Synthetic data generation for replication, since the original training data cannot be shared publicly due to GDPR concerns
Extraction of the model's decision trees

Repo Overview

Follow setup instructions to reproduce results and create your own experiments
Raw Synthetic Data is kept here
Synthetic Data Generation model is kept here
Experiment results are kept here
The trained model file is kept here
The source code used to train the model is here

Notebooks

20221020_jb_synth-model.ipynb: Synthetic data generation script
data-generator.ipynb: Generates custom data to run experiment on one or more attributes.
analyze.R: Script runs GBM model and tests for violations of statistical parity.
20221208_ted_build_trees.R: Extract and visualize decision tree

Contributing Members

Justin-Casimir Braun | https://github.com/jusbraun | justin-casimir@lighthouseports.com

Htet Aung | https://github.com/NecklessCage | htet@ligthhousereports.com

Gabriel Geiger | https://github.com/gheghi18 | gabriel@lighthousereports.com

Eva Constantares | eva@lighthousereports.com

Lighthouse-Reports/suspicion_machine