/moers-hackday

All you need to know to use Diffix anonymization for the Moers Hackday

moers-hackday

All you need to know to use Diffix anonymization for the Moers Hackday

Data set

The data set for the Moers Hackday consists of several years of traffic fines, both moving and stationary. The basic information for the data set is taken from Moers Open Data (see this page).

Moers removed all personal information from the data set. This makes the dataset private, but results in the loss of some potentially information about the data: How many different cars have been fined? How many fines do different cars have? Where do the cars come from?

Diffix allows us to learn this kind of information while still protecting privacy. To demonstrate this, we have added synthetic data about cars to the Moers data. We assigned cars, along with their license plate numbers according to a probability distribution. Diffix generates statistical information about the data set (including the people) without revealing individual information about the people.

Analyze data without SQL

The Open Diffix offers a software tool, Diffix for Desktop, that allows an analyst to prepare anonymized tables through a simple GUI interface. It is available for Windows, Mac, or Linux.

To get started with Diffix for Desktop:

  1. Download Diffix for Desktop.
  2. Download finalMoers.csv.zip from this GitHub repo and unzip.
  3. After starting Diffix for Desktop, load finalMoers.csv into the application.
  4. Configure the AID to be the column PID

Now you are ready to explore the data set.

Analyze data with SQL

The Open Diffix offers a second tool, Diffix for PostgreSQL that is installed as the PostgreSQL extension pg_diffix. This allows analysts to access data through the standard PostgreSQL interface, thus making it possible to use a variety of standard visualization tools and SQL clients, including psql with Diffix.

Diffix for PostgreSQL offers substantially more query options than Diffix for Desktop. If you know SQL, or would like to learn SQL, then this is the better choice.

Diffix for PostgreSQL allows only a limited subset of SQL, and has some other idiosyncrasies as well. Details can be found in our Analyst Guide. Alternatively, we provide a training app that takes you through step-by-step examples using live queries. It takes about 5 minutes to complete for basic queries, and around 30 minutes for the whole course. (It also serves as a good reference for query templates.)

We offer several options to analyze data with SQL for Hackday:

Data visualization using Metabase

Metabase is a "Business Intelligence" tool that connects to a backend database system and let's you easily visualize data and build dashboards. Click here for an example of a simple dashboard built from the Moers traffic data.

Metabase supports both SQL-based and GUI-based queries. The latter doesn't work well with pg_diffix, but the former does. Click here for details on how to login to our Metabase installation and query the Moers traffic data.

DBeaver SQL client

If you prefer to use a more standard SQL client, DBeaver is a lightweight client that works for pg_diffix. After downloading, configure the Moers traffic database with the following parameters:

And password moers.

psql

psql also works with pg_diffix. You can run it with the same parameters shown for DBeaver above. (If you don't know what psql is, then this isn't a good option for you for Hackday.)

Jupyter Notebook

Finally, if you prefer working from a Jupyter Notebook, you can use this notebook as a template to work from (changing the login parameters as needed).