Plan of Work:
For each county, just for the people in households (not gq):
-
Load individual person data synthesized for that county in re-id exercise (
df_synth
) -
Split
df_synth
it into (a) simulated external data (df_sim
) and (b) hold-out validation data (df_test
) -
Load corresponding county of privacy protected person data from PL-74 demonstration product (
df_ppmf
) -
merge
df_sim
anddf_ppmf
on their common fields, e.g. track, block, and voting_age -
see how many individuals in
df_sim
have a unique match, and how many individuals indf_sim
have a unique race or ethnicity in the matched ppmf data -
for the individuals in
df_sim
with a unique match or with matches with a unique race/ethnicity, see how often this linked data is the same as the value in the validation data indf_test
Since the PPMF is based on the pre-swapped data, step (6) is perhaps not meaningful. If instead of loading df_ppmf, I simulate it from the same df_synth data, I can make it meaningful, and also sweep across values of epsilon. And then compare at epsilon empirical found from PPMF.