The overall framework of GenSyn is depicted here:
As shown, the framework consists of three main components: conditional probabilties modeling, gaussian copula sampling, and maximum entropy optimization. This repository contains an implementation of GenSyn and is based on both R and Python. R is used for implementing the conditional probabilities modeling algorithm and the gaussian copula sampling algorithm. And, python is used for maximum entropy optimization using the L-BFGS-B algorithm. The required dependencies and the corresponding source codes of these methods are detailed separately:
The conditional probabilities method is implemented on top of the SynthACS framework with necessary modifications. Therefore, the details on the functions implemented in the code can be found in [1]. We implement the conditional probabilties method for two different datasets (i) One that only consists of ACS variables - provided in r_src/conditional_probabilities_acs.R (ii) One that combines ACS variables with variables from other sources - provided in r_src/conditional_probabilities_combined.R.
- data.table
- synthACS
- dplyr
- acs
- purrr
- stringr
- hash
- jsonlite
The Gaussian copula sampling code is motivated by [2]. As with the conditional probabilties method, it consists of codes for two different datasets and are provided in r_src/copula_sampling_acs.R and r_src/copula_sampling_combined.R.
- mvtnorm
- fitdistrplus
- dplyr
The maximum entropy optimization is implemented in python. Here, python_src/maxentropy_all_counties.py and python_src/maxentropy_all_counties_acs.py represent the codes for optimization using maximum entropy for the two datasets. python_src/get_puma.py and python_src/get_puma_acs.py are implemented for obtaining corresponding pums sample data.
- statistics
- scipy
- matplotlib
- scipy
- dython
- numpy
- pypums
- pandas
- os
- copy
The demographic data is obtained from the American Community Survey (ACS) using the SythACS framework. Data on opioid overdose is obtained from CDC Wonder [3], data on insurance is obtained from Small Area Health Insurance Estimates (SAHIE) [4], and data on veteran population is obtained from Opioid Environment Policy Scan (OEPS) [5].
All the necessary input data are adjusted according to the required formats and are made available inside the "data" directory
[1] Whitworth A (Forthcoming). “synthACS: Spatial MicroSimulation Modeling with Synthetic American Community Survey Data.” Journal of Statistical Software.
[2] Wan, C., Li, Z., Guo, A., and Zhao, Y., "SynC: A Unified Framework for Generating Synthetic Population with Gaussian Copula," Workshops at the Thirty-Fourth AAAI Conference on Artificial Intelligence, 2020. Accepted, to appear.
[3] Wide-ranging online data for epidemiologic research (WONDER). Atlanta, GA: CDC, National Center for Health Statistics; 2020. Available at http://wonder.cdc.gov
[4] US Census Bureau. (2021, October 8). 2008 - 2019 Small Area Health Insurance Estimates (SAHIE) using the American Community Survey (ACS) [Dataset]. https://www.census.gov/data/datasets/time-series/demo/sahie/estimates-acs.html
[5] Marynia Kolak, Qinyun Lin, Susan Paykin, Moksha Menghaney, & Angela Li. (2021, May 11). GeoDaCenter/opioid-policy-scan: Opioid Environment Policy Scan Data Warehouse (Version v0.1-beta). Zenodo. http://doi.org/10.5281/zenodo.4747876