Welcome to the PrivacyHistos repository. This repo contains five differential privacy functions for combining features into privatized histograms. These functions can be used in python applications to produce a synthetic dataset based on a 'ground truth' dataset, with an aim of adhering to differential privacy, as demostrated by the examples. The five functions are contained in the privacy.py file located in the src directory.
- Basic Formatted Ground Truth
- Basic Sensitivity and Epsilon
- Basic Combination Dictionary
- Basic Numeric Dictionary
- 1 check_input
- 2 preprocess
- 3 histo_test
- 4 create_private_histos
- 5 col_decoder
- Convert Raw Data to Formatted Ground Truth
- Configure and Create the Histograms
- Create Individual Rows of Simulated Data
- Results and Discussion
There are two environments included in this repo, both require Python 3.8.5+:
- Basic - The requirements1.txt file contains the bare bones packages necessary to run private.py and the Basic example.
- Advanced - The requirements2.txt files contains the packages necessary for running all the examples and unit testing.
The data directory holds the necessary data files and only contains the data files for the Basic example. In order to run the other examples the data files below will need to be downloaded from Kaggle (you'll need a Kaggle account which is free) into the data directory with the following names:
- ground_truth_2.csv - The ground truth file for the Sprint 2 example (about 111.9MB). download
- ground_truth_3.csv - The ground truth file for the Sprint 3 example (about 623.5MB). download
To run the 11 unit tests simply navigate to the test directory and run pytest.
In order to use the functions the following input is required:
- The ground_truth data file
- All data must be encoded as positive integers with no Nans
- The column names must be appended as follows:
- _c – indicates a categorical feature
- _n – indicates a numeric feature
- _i – One, and only one, column must be the individual identifier
- _x – indicates a feature to include but not used as a combination feature
- Sensitivity = (number of histograms * sample size) + population queries
- Number of histograms – the total number of histograms which may include other histograms not using the combined column approach.
- Sample size – the number of individual records to use for building the histograms (if sampling is used).
- Population queries – the total number of population queries used in the overall approach.
- Epsilon - The epslion value to use (typically in the range of .1 to 10)
- The combo_dict with the key = combo name and the value = list of columns
- Format - {combo_name: column_list}
- combo_name – name of the the new combined feature
- column_list – a list of columns to combine (categorical and/or numeric)
- Example - {‘cp’: [‘company_c’, ‘payment_c’]}
- Explanation – Create a column named ‘cp’ combining categorical fields ‘company_c’ and ‘payment_c’
- Requirements - You cannot combine catgorical columns with numeric columns
- The num_dict with the key = column name and the value = list of top value, increment and maximum range
- Format - {column_name: [top_value, increment, max_range]}
- column_name – the name of the numeric feature
- top_value – the top value to use with the increment
- increment – the amount in each bin
- max_range – the maximum value of the numeric range
- Example – {‘fare_n’: [50, 5, 100]}
- Explanation – Create 11 numeric bins for numeric feature ‘fare_n’ with bins of 5 from 0-50 (10 bins of 5) with numbers over 50 in a bin of 50-100 (1 bin of 50).
There are five public functions located in private.py:
Tests the formatted ground truth data, the combination dictionary and the number dictionary for proper formatting and values.
- check_input(ground_truth, combo_dict, num_dict)
- ground_truth – the formatted ground truth dataframe
- combo_dict – the combination dictionary
- num_dict – the numeric dictionary
- returns 1 if True
Preprocesses the ground truth data by combining the columns per the combination dictionary and number dictionary
- pre_process(ground_truth, combo_dict, num_dict)
- ground_truth – the formatted ground truth dataframe
- combo_dict – the combination dictionary
- num_dict – the numeric dictionary
- returns a new dataframe with combined columns (df)
- returns a numeric decoder dictionary (num_decode)
- returns a column decoder dictionary (col_decode)
Counts the number of bins created for each combined column in the pre-processed dataframe. A useful tool for determining the optimal feature combinations.
- histo_test(df, combo_dict)
- df – the pre-processed dataframe
- combo_dict – the combination dictionary
- prints a count of bins for each combined column
Creates privatized histograms of the specified combined column using a specified sample size, sensitivity and epsilon.
- create_private_histo(df, column, sample, sample_size, sensitivity, epsilon)
- df – the pre-processed dataframe
- column – the combined column
- sample - enter 1 to use sampling
- sample_size – the size of the sample
- sensitivity – the sensitivity
- epsilon – the epsilon
- returns population (bins) as a list
- returns privatized weights as a list
Decodes the specified column into a its privatized value.
- col_decoder(num_dict, num_decode, col_decode, value, column)
- num_dict – the numeric dictionary
- num_decode – the numeric decoder dictionary
- col_decode – the column decoder dictionary
- value – a value to decode
- column – the column to decode
- returns decoded privatized value
The Basic example is located in the examples directory in the basic directory. The python file is main.py. It uses a small sample of data to demonstrate the use of the five functions.
The first five rows of the ground_truth_test.csv are shown below:
taxi_id_i,shift_c,company_c,pca_c,dca_c,payment_c,fare_n,tips_n,seconds_n,miles_n
1004636,0,27,76,0,1,44,14,1440,0
1004636,0,27,8,8,0,10,0,720,0
1004636,0,27,8,8,0,5,0,240,0
1004636,0,27,8,6,1,10,2,660,0
- taxi_id_i = a unique taxi driver id
- shift_c = a taxi shift number from 0-20 (categorical)
- company_c = a company id from 0-99 (categorical)
- pca_c = a pickup community area from 0-77 (categorical)
- dca_c = a dropoff community area from 0-77 (catgorical)
- payment_c = payment type from 0-9(categorical)
- fare_n = fare paid (numeric)
- tips_n = tips paid (numeric)
- seconds_n = trip seconds (numeric)
- miles_n = trip miles (numeric)
Notice the data is properly formatted.
The Basic sensitivity and epsilon are calculated in main.py:
number_histos = 3 # Create 3 histograms
sample = 1 # Use sampling
sample_size = 3 # Sample size equals 3
population_queries = 2 # Use 2 population queries
epsilon = 10.0 # Use an epsilon value of 10
# Calculate the sensitivity
sensitivity = (number_histos * sample_size) + population_queries
In this example the sensitivity is 11 and epsilon is 10.
The Basic combination dictionary in main.py is:
# Define the combined columns for the 3 histograms
combo_dict = {'spd': ['shift_c', 'pca_c', 'dca_c'],
'cp': ['company_c', 'payment_c'],
'fare': ['fare_n', 'tips_n', 'seconds_n', 'miles_n']
}
For example, the shift_c, pca_c and dca_c columns are combined into the spd column.
The Basic numeric dictionary in main.py is:
# Define the number dictionary for each numeric column
num_dict = {'fare_n': [50, 5, 100],
'tips_n': [20, 2, 50],
'seconds_n': [5000, 100, 10000],
'miles_n': [20, 2, 50]
}
For example, the fare_n column data is divided into 11 numeric bins with bins of 5 from 0-50 (10 bins of 5) with numbers over 50 in a bin of 50-100 (1 bin of 50).
# Check if the ground truth file is properly formatted
valid = privacy.check_input(ground_truth, combo_dict, num_dict)
The privacy.check_input function checks to see if the ground truth file is properly formatted and returns 1 if True.
# Preprocess the ground truth by combining the columns and creating decodings
df, num_decode, col_decode = privacy.preprocess(ground_truth, combo_dict, num_dict)
The returned preprocessed data frame is:
taxi_id_i spd cp fare
0 1004636 1007600 1271 19172510
1 1004636 1000808 1270 12101810
2 1004636 1000808 1270 11101310
3 1004636 1000806 1271 12111710
4 1004636 1000824 1271 12111510
.. ... ... ... ...
211 1004638 1192843 1270 11102810
212 1004638 1194346 1270 11101310
213 1004638 1195000 1270 11104110
214 1004638 1194250 1270 11102310
215 1004638 1195028 1270 11104010
[216 rows x 4 columns]
The returned numeric decoder dictionary is:
{'fare_n': {10: 0.0, 11: 5.0, 12: 10.0, 13: 15.0, 14: 20.0, 15: 25.0, 16: 30.0, 17: 35.0, 18: 40.0, 19: 45.0, 20: 50.0, 21: inf}, 'tips_n': {10: 0.0, 11: 2.0, 12: 4.0, 13: 6.0, 14: 8.0, 15: 10.0, 16: 12.0, 17: 14.0, 18: 16.0, 19: 18.0, 20: 20.0, 21: inf}, 'seconds_n': {10: 0.0, 11: 100.0, 12: 200.0, 13: 300.0, 14: 400.0, 15: 500.0, 16: 600.0, 17: 700.0, 18: 800.0, 19: 900.0, 20: 1000.0, 21: 1100.0, 22: 1200.0, 23: 1300.0, 24: 1400.0, 25: 1500.0, 26: 1600.0, 27: 1700.0, 28: 1800.0, 29: 1900.0, 30: 2000.0, 31: 2100.0, 32: 2200.0, 33: 2300.0, 34: 2400.0, 35: 2500.0, 36: 2600.0, 37: 2700.0, 38: 2800.0, 39: 2900.0, 40: 3000.0, 41: 3100.0, 42: 3200.0, 43: 3300.0, 44: 3400.0, 45: 3500.0, 46: 3600.0, 47: 3700.0, 48: 3800.0, 49: 3900.0, 50: 4000.0, 51: 4100.0, 52: 4200.0, 53: 4300.0, 54: 4400.0, 55: 4500.0, 56: 4600.0, 57: 4700.0, 58: 4800.0, 59: 4900.0, 60: 5000.0, 61: inf}, 'miles_n': {10: 0.0, 11: 2.0, 12: 4.0, 13: 6.0, 14: 8.0, 15: 10.0, 16: 12.0, 17: 14.0, 18: 16.0, 19: 18.0, 20: 20.0, 21: inf}}
The returned column decoder dictionary is:
{'shift_c': ['spd', 1, 3], 'pca_c': ['spd', 3, 5], 'dca_c': ['spd', 5, 7], 'company_c': ['cp', 1, 3], 'payment_c': ['cp', 3, 4], 'fare_n': ['fare', 0, 2], 'tips_n': ['fare', 2, 4], 'seconds_n': ['fare', 4, 6], 'miles_n': ['fare', 6, 8]}
# Print a count of the number of bins for each histogram
privacy.histo_test(df, combo_dict)
The histo test prints the following:
2021-08-10 09:51:16.368 | INFO | privacy:histo_test:294 - number of bins in spd 198
2021-08-10 09:51:16.368 | INFO | privacy:histo_test:294 - number of bins in cp 3
2021-08-10 09:51:16.369 | INFO | privacy:histo_test:294 - number of bins in fare 90
The create_private_histo function create the privatized histograms as lists of populations and weights.
# Create privatized histograms for the specified column using the sample size, sensitivity and epsilon
spd_population, spd_weights = privacy.create_private_histo(df, 'spd', sample, sample_size, sensitivity, epsilon)
cp_population, cp_weights = privacy.create_private_histo(df, 'cp', sample, sample_size, sensitivity, epsilon)
fare_population, fare_weights = privacy.create_private_histo(df, 'fare', sample, sample_size, sensitivity, epsilon)
Using the histogtams (populations and weights), a random value for a combined column is selected and decoded.
# Select a random combined column value from the histogram
spd_value = choices(spd_population, spd_weights)
# Decode the column values from the combined column values
shift = privacy.col_decoder(num_dict, num_decode, col_decode, spd_value[0], 'shift_c')
pca = privacy.col_decoder(num_dict, num_decode, col_decode, spd_value[0], 'pca_c')
dca = privacy.col_decoder(num_dict, num_decode, col_decode, spd_value[0], 'dca_c')
The values of the combined column and decoded values are:
spd_value [1130808]
shift 13
pca 8
dca 8
The Detroit example is located in the examples directory in the detroit directory. This example demonstrates how to generate a basic synthetic dataset from a raw .csv dataset. The example uses the City of Detroit 2015 Fire Data.
The convert.py code converts the raw data to properly formatted ground truth data.
The goal of the conversion is to simplify the data as much as possible. In the case of the Detroit data there are 5 location columns (ADDRESS, ENGINE AREA, X, Y, and LOCATION). Only one location column is needed, so ENGINE AREA is used and the other 4 are deleted. In addition, there are 8 date and time columns related to incident calls. The columns DATE OF CALL and TIME OF CALL contain month, day year, hour, minute and second of the call and are reduced to call_month_c, call_day_c and call_hour_c. As for the the 6 fields related to the date and time of dispatch, arrival and clearance of an incident call (DATE OF DISPATCH, TIME OF DISPATCH, DATE OF ARRIVAL, TIME OF ARRIVAL, DATE UNIT CLEARED, and TIME UNIT CLEARED), they are reduced to minutes-from-call in dispatch_n, arrival_n, and clear_n. The INCIDENT TYPE and INCIDENT TYPE CATEGORY contain the same information and are reduced to indicent_type_c. A synthetic incident_i column is created to replace the actual INCIDENT #column. The remaining columns match the raw data.
The main.py code is the main program containing the configuration and creation of the privatized histograms. It calls the simulate_row.py code to create a row of simulated data.
The main.py code begins with configuration parameters. There are a total of 4 histograms used, 3 combined histograms and 1 population query. The maximum record per individual incident is 1, the sample size. An epsilon value of 1 is used.
number_histos = 4 # Create 4 histograms
population_queries = 1 # Use one population query for number of incidents
sample = 0 # Do not use sampling
sample_size = 1 # Sample size is 1
epsilons = [1.0] # Use and epsilon value of 1.0
The columns are combined into 4 combined histograms: type, injury, call and result. The general idea is to combine similar or correlated columns. Categorical and numeric features cannot be combined in the same histogram. In this example, all injury related columns are combined (injury), all call related categorical columns are combined (call), all call related numeric columns are combined (result) and the remaining descriptive columns are combined (type).
combo_dict = {'type': ['engine_area_c', 'exposure_c', 'incident_type_c', 'property_use_c', 'detector_c', 'structure_stat_c'],
'injury': ['cinjury_c', 'cfatal_c', 'finjury_c', 'ffatal_c'],
'call': ['call_month_c', 'call_day_c', 'call_hour_c'],
'result': ['dispatch_n', 'arrival_n', 'clear_n']
}
For the numeric columns, the range of bin values needs to be defined. For dispatch_n and arrival_n bins are created from 0 to 1000 in increments of 50 with the remaining values in a bin from 1000 to 5000. For clear_n bins are created from 0 to 5000 in increments of 50 with the remaining values in a bin from 5000 to 10000.
num_dict = {'dispatch_n': [1000, 50, 5000],
'arrival_n': [1000, 50, 5000],
'clear_n': [5000, 50, 10000]
}
Once the configuration is complete the main() program begins by loading the converted ground truth file and checking that it meets the basic formatting criteria of all integers, no Nans and appropriately appended column names.
ground_truth = pd.read_csv(ground_truth_file)
valid = privacy.check_input(ground_truth, combo_dict, num_dict)
if valid != 1:
return
The ground truth is then preprocessed into a new combined column dataframe and numeric/column decoding information.
df, num_decodes, col_decodes = privacy.preprocess(ground_truth, combo_dict, num_dict)
privacy.histo_test(df, combo_dict)
A loop based on the number of epsilon values is entered, but since there is only one epsilon the loop is run through once in this example. The first step in the loop is to calculate the sensitivity which equals the number of histograms times the sample size plus the number of population queries. This value equals (3 x 1) + 1 = 4. The Laplace Mechanism is then used to add noise to the total number of incidents in the ground truth to create the number of incidents in the new privatized synthetic dataset. This number will vary each time the program is run.
sensitivity = (number_histos * sample_size) + population_queries
num_incidents = len(df)
num_incidents_noise = int(privacy.laplaceMechanism(num_incidents, sensitivity, epsilon))
The 4 privatized histograms are then created and consist of a population list and a weight list.
type_pop, type_w = privacy.create_private_histo(df, 'type', sample, sample_size, sensitivity, epsilon)
injury_pop, injury_w = privacy.create_private_histo(df, 'injury', sample, sample_size, sensitivity, epsilon)
call_pop, call_w = privacy.create_private_histo(df, 'call', sample, sample_size, sensitivity, epsilon)
result_pop, result_w = privacy.create_private_histo(df, 'result', sample, sample_size, sensitivity, epsilon)
Finally, for each simulated incident, the histograms (population and weight lists) are randomly sampled and a random value based on the weights of the privitized data is passed to simulate_row to be decoded into the original values.
for i in range(num_incidents_noise):
type_value = choices(type_pop, type_w, k=1)
injury_value = choices(injury_pop, injury_w, k=1)
call_value = choices(call_pop, call_w, k=1)
result_value = choices(result_pop, result_w, k=1)
row = simulate_row(i,
type_value[0],
injury_value[0],
call_value[0],
result_value[0],
num_dict,
num_decodes,
col_decodes
)
final_list.append(row)
The simulate_row.py code creates an individual row of simulated data.
The simulate_row functions creates one simulated row by decoding the combined columns into their respective values.
row = {}
row['incident_i'] = incident + 1
row['engine_area_c'] = privacy.col_decoder(n_dict, n_decode, c_decode, type, 'engine_area_c')
row['exposure_c'] = privacy.col_decoder(n_dict, n_decode, c_decode, type, 'exposure_c')
row['incident_type_c'] = privacy.col_decoder(n_dict, n_decode, c_decode, type, 'incident_type_c')
row['property_use_c'] = privacy.col_decoder(n_dict, n_decode, c_decode, type, 'property_use_c')
row['detector_c'] = privacy.col_decoder(n_dict, n_decode, c_decode, type, 'detector_c')
row['structure_stat_c'] = privacy.col_decoder(n_dict, n_decode, c_decode, type, 'structure_stat_c')
row['cinjury_c'] = privacy.col_decoder(n_dict, n_decode, c_decode, injury, 'cinjury_c')
row['cfatal_c'] = privacy.col_decoder(n_dict, n_decode, c_decode, injury, 'cfatal_c')
row['finjury_c'] = privacy.col_decoder(n_dict, n_decode, c_decode, injury, 'finjury_c')
row['ffatal_c'] = privacy.col_decoder(n_dict, n_decode, c_decode, injury, 'ffatal_c')
row['call_month_c'] = privacy.col_decoder(n_dict, n_decode, c_decode, call, 'call_month_c')
row['call_day_c'] = privacy.col_decoder(n_dict, n_decode, c_decode, call, 'call_day_c')
row['call_hour_c'] = privacy.col_decoder(n_dict, n_decode, c_decode, call, 'call_hour_c')
row['dispatch_n'] = privacy.col_decoder(n_dict, n_decode, c_decode, result, 'dispatch_n')
row['arrival_n'] = privacy.col_decoder(n_dict, n_decode, c_decode, result, 'arrival_n')
row['clear_n'] = privacy.col_decoder(n_dict, n_decode, c_decode, result, 'clear_n')
The Detroit example creates privitized simulated data that closely resembles the ground truth data. The number of incidents is usually within about 0-15 incidents out of almost 20,000 incidents. The distribution of values in each of the simulated columns closely resembles the distribution of values in each of the ground truth columns. The actual values of the categorical columns of the simulated data match the catgorical values of the ground truth columns. While the numeric values in the simulated data don't match the numeric values in the ground truth data, they are in a reasonable range.
The area where the simulated data is weak is in the linkage of the combined columns. For example, for certain incident types (e.g. false alarms, unintentional calls) one would expect the clearance times to be shorter than other incident types (building fire, gasoline spill). So in order to improve the accuracy of the simulated data a link would need to be made between the combined column type containing the incident_type_c column and the combined column result containing the clear_n column. An example of "linking" combined columns can be found in the Sprint 3 Example. For the purposes of demonstration, however, this very simple example creates a reasonable set of simulated data from raw data using a relatively small amount of code.
The Sprint 3 example is located in the examples directory in the sprint3 directory. This example demonstrates the use of the privacy functions in Sprint 3 of the Differential Privacy Temporal Map Challenge - Sprint 3. This link provides details on Sprint 3 as well as an overview of the approach used in the contest. The code listed below is well documented and demonstrates the use of the privacy functions. Running the example can take several hours depending on your computer.
The convert_s3.py code converts the raw data to properly formatted data.
The main.py code is the main program - run python main.py creating a submission.csv file
The metric.py code calculates the metrics for the results - run python metric.py ground_truth.csv submission.csv
The privacy.py code contains the privacy functions - called from main.py
The simulate_row.py code creates a simulated row - called from main.py
The trips.py code contains trip related functions - called from main.py
The Sprint 2 example is located in the examples directory in the sprint2 directory. This example demonstrates the use of the privacy functions in Sprint 2 of the Differential Privacy Temporal Map Challenge - Sprint 2. This link provides details on Sprint 2 as well as an overview of the approach used in the contest. The code listed below is well documented and demonstrates the use of the privacy functions. Running the example can take several hours depending on your computer.
The convert_s2.py code converts the raw data to properly formatted data.
The main.py code is the main program - run python main.py creating a submission.csv file
The metric.py code calculates the metrics for the results - run python metric.py ground_truth.csv submission.csv
The privacy.py code contains the privacy functions - called from main.py
The simulate_row.py code creates a simulated row - called from main.py