/cse544-project

CSE 544 final project on differential privacy

Primary LanguageJupyter Notebook

CSE 544 Final Project

Differential Privacy in the OMOP CDM

To get your environment set up, follow the instructions below. See the experiments 👨🏻‍🔬 directory for the actual work.

Setup

A lot of the following is based off these instructions, and most of the SQL comes from here.

Prerequisites

PostgreSQL

Make sure you have PostgreSQL installed and running somewhere. I used the following to spin it up in Docker:

docker create volume cse544-project
docker run --name cse544-project -p 5432:5432 -v ${PWD}:/data -v cse544-project:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -d postgres:10.1-alpine

Data

Download the 20% of SynPUF dataset here (~113k patients, 920MB compressed) and extract the CSV files into the data directory:

wget ftp://ftp.ohdsi.org/synpuf/1.0.1/synpuf_1.zip
unzip synpuf_1.zip -d data

Note we are using SynPUF v1.0.1.

Metadata

Download the vocabularies (shortcut shown after manual way).

Manual Way

Following step 3 here:

Register at athena.ohdsi.org. Note that if your password doesn't meet the following requirements, the site will just fail silently:

  • Password must be at least 10 characters in length
  • Password must contain at least 1 uppercase characters
  • Password must contain at least 1 digit characters
  • Password must contain at least 1 non-alphanumeric characters

I tried to use & and it fell over. It only worked when I removed all non-alphanumeric characters 🤦🏻‍️.

Then, go to Download and select only the first 8 vocabularies, excluding #7 (NDFT) and get the ZIP file. Or bypass their "security" and download the ZIP file I already assembled using the link below.

☝🏼 Miss me with that noise

wget -O vocab.zip http://athena.ohdsi.org/api/v1/vocabularies/zip/24c823f7-2e79-4a51-abd6-315c945aa098

Prepare Vocabularies

Whichever way you got the ZIP file, extract it into vocab directory:

unzip vocab.zip -d vocab

We also have to manually merge the CPT codes into CONCEPT.csv:

cd vocab
java -jar cpt4.jar 5

Loading data

For simplicty, we'll do everything in a new database called cdm, and we'll use the public schema to avoid having to prefix every table with the schema name every time. We will also use the postgres (admin) user so we don't have to grant permissions.

Create Schema

Drop into the Postgres shell:

psql -U postgres -h localhost

Create the database:

create database cdm;

Connect to the new database:

\c cdm;

Create the tables:

\i /data/sql/create_CDMv5_tables.sql

💡 You will likely have to change the path to the SQL file.

Load Vocabularies

Edit the load_CDMv5_vocabulary.sql file so that it contains the correct abolute paths to your CSV files, then:

\i /data/sql/load_CDMv5_vocabulary.sql

This will take a few minutes.

💡 You will likely have to change the path to the SQL file.

Load Data

Edit the load_CDMv5_synpuf.sql file so that it contains the correct abolute paths to your CSV files, then:

\i /data/sql/load_CDMv5_synpuf.sql

This will take a few minutes.

💡 You will likely have to change the path to the SQL file.

Create Constraints & Indices

Run the following two scripts to create the contraints and indices:

\i /data/sql/create_CDMv5_constraints.sql

and

\i /data/sql/create_CDMv5_indices.sql

This takes a while, and there are some errors that I'm ignoring for now.

💡 You will likely have to change the path to the SQL files.

ERA Tables

There is a section here about creating more tables, but that probably isn't necessary for now.