/search-match-age-estimation

Estimation code for search and matching model with aging

Primary LanguageR

Estimation

Prepare ACS data

Use csv2sqlite.py to save the raw csv.gz data file into an sqlite database.

  • acs: table from ACS microdata
    • python csv2sqlite.py --gzip acs_08-16.csv.gz acs_08-16.db acs
  • mig2met: table to convert migration state/puma to puma
    • run data-prep.r to build the csv from the two csv files mapping puma and msa
    • then load it into the sqlite database as another table
    • python csv2sqlite.py mig2met.csv acs_08-16.db mig2met Then use SQL queries to get aggregated values to avoid loading the entire dataset into memory. Queries apply categorizations (race, edu) on-the-fly, so no need to pre-clean the data.

Estimation of model objects

  • set up data: check these files for correct filenames per model (different specifications by age and type)
  • smooth-pops.r: query and smooth aggregated population counts in each desired metro
    • Total/single/married populations, marriage/divorce flows, migration flows
    • Smoothing by non-parametric regression (local-polynomial): using hand-rolled "diagonal" smoothing kernel (manual bandwidth)
    • Saves smoothed data to csv for loading into julia
  • mort-rates.r: interpolates and saves death rates
  • main-estim.jl: runs the show, but need to set options first
    • loads populations from saved JLD files, or calls prepare-pops.jl to generate them anew
    • prepare-pops.jl: loads csv files generated by R scripts above, then converts DataFrames to multidimensional arrays (per metro) and saves as JLD files
    • estimate arrival rates and then non-parametric objects using estim-functions.jl and compute-npobj.jl
    • can also do a parameter grid search or monte carlo estimation
  • plot-results.r: plot model-data fit and estimated objects
    • tikz-conversion.R: produce tikz figures from saved plot objects

Bootstrap Standard Errors

Run scripts in order to set up resampled datasets, run smoothing, and then estimation. Uses GNU Parallel for efficient batch processing.

  1. Rscript bootstrap-resampler.r: creates directories data/bootstrap-samples/resamp_00 with resampled csv data
  2. bash bootstrap-create-db.sh: creates sqlite db from csv files
  3. bash bootstrap-smooth.sh: runs smooth-data.r for both ageonly and racedu specifications
    • Took 40 hours for 100 resamples on 8 cores, low memory usage (<2GB)
  4. bash bootstrap-cp-psi.sh: copies the death rate data into the smoothed populations directories for each resample
  5. bash bootstrap-estim.sh: runs main-estim.jl for both ageonly and racedu specifications
    • Took 100 minutes for 100 resamples on 8 cores, low memory usage (<4GB)

Largest metro areas by adult population (millions)

  1. 35620: 14.5m - New York-Newark-Jersey City, NY-NJ-PA
  2. 31080: 9.4m - Los Angeles-Long Beach-Anaheim, CA
  3. 16980: 6.8m - Chicago-Naperville-Elgin, IL-IN-WI
  4. 19100: 4.6m - Dallas-Fort Worth-Arlington, TX
  5. 37980: 4.4m - Philadelphia-Camden-Wilmington, PA-NJ-DE-MD
  6. 26420: 4.2m - Houston-The Woodlands-Sugar Land, TX
  7. 47900: 4.1m - Washington-Arlington-Alexandria, DC-VA-MD-WV
  8. 33100: 4.1m - Miami-Fort Lauderdale-West Palm Beach, FL
  9. 12060: 3.8m - Atlanta-Sandy Springs-Roswell, GA
  10. 14460: 3.5m - Boston-Cambridge-Newton, MA-NH
  11. 41860: 3.3m - San Francisco-Oakland-Hayward, CA
  12. 19820: 3.1m - Detroit-Warren-Dearborn, MI
  13. 38060: 3.1m - Phoenix-Mesa-Scottsdale, AZ
  14. 40140: 3.0m - Riverside-San Bernardino-Ontario, CA
  15. 42660: 2.6m - Seattle-Tacoma-Bellevue, WA
  16. 33460: 2.4m - Minneapolis-St. Paul-Bloomington, MN-WI
  17. 41740: 2.3m - San Diego-Carlsbad, CA
  18. 45300: 2.1m - Tampa-St. Petersburg-Clearwater, FL
  19. 41180: 2.0m - St. Louis, MO-IL
  20. 12580: 2.0m - Baltimore-Columbia-Towson, MD

30th metro is 1.4m, 40th is 0.9m.

(Deprecated) Estimation of rate parameters by OLS

  • estimate-rates-full.r and estimate-rates.r
    • very poor accuracy due to noisy inference on divorce flows
  • Need marriage and divorce rates for each couple-type (globally)
    • Marriage rate (directly observable): SQL queries for flows and stocks to compute rates
    • Divorce rate (infer from non-divorce rate and death rate)
  • Weighted OLS (by stocks of couples)